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ABSTRACT 

The equating of reasonably parallel forms of College 
Board Achievement Tests in biology, chemistry, mathematics level II, 
American history and social studies, and French is discussed. Results 
of the following five equating methods are compared: (1) Tucker; (2) 
Levine equally reliable; (3) Levine unequally reliable; (4) frequency 
estimation equipercentile; and (5) chained equipercentile. These 
methods are used with an internal common-item anchor-test data 
collection design and three sampling strategies (random samples from 
populations similar in ability level, random samples from populations 
of dissimilar ability, and samples from dissimilar populations 
constructed to be similar by matching on the basis of a rjovariate 
?uch as the distribution of scores on a set of common items) • Results 
indicate that it may be difficult, and in some cases impossible, to 
equate achievement tests using new-form and old-form samples from 
populations differing ir jibility level. All these equating methods 
appear to be affected r)y group differences in ability, with the 
Tucker and frequency estimation equipercentile methods the most 
affected, and the chained equipercentile and the two Levine 
procedures the most robust. Matching cannot be recommenc^ed for 
rectifying sample ability-level differences. There are 17 tables of 
study data, 33 figures, and a list of 16 references. (Author/SLD) 
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ABSTRACT 

The equating of reasonably parallel forms of College Board 
Achievement Tests in Biology, Chemistry, Mathematics Level 11, 
American History and Social Studies, and French is discussed in 
this paper The results of five equating methods are compared: (1) 
Tucker, (2) Levine equally reliable, (3) Levine unequally reliable, 
(4) frequency estimation equipercentile, and (S) chained equiper- 
centile. These methods are used with an internal common-item 
anchor-test data collection design. Three sampling strategies were 
evaluated: (I) random samples from populations similar in ability 
level, (2) random samples from populations dissimilar in ability 
level, and (3) samples from populations dissimilar in ability level 
that have been constructed to be similar in ability level by niatch- 
ing on the basis of a co»'ariate, such as the distribution of scores on 
a set of common items. The criteria for comparison in all ca.ses 
were the results of the Tlicker pro'redure used with random samples 
from populations similar in ability level. These results were used 
as the criterion for equating results because they represent results 
obtained under the most optimal operational conditions. 

The results of the study indicate that it may be difficult, and 
in some cases impossible, to equate achievement tests using new- 
and old-form samples obtained from populations that are different 
in ability level. All equating methods investigated in this study 
appear to be affected by group differences in ability. The equating 
methods that appear to be the most affected by these differences 
are the Tlicker and frequency estimation equipercentile proce- 
dures. The methods that appear to be the most robust to group 
differences in ability are the chained equipercentile and the iwo 
Levine procedures. 

Matching on the basis of observed scores on a set of internal 
common items does not remedy the situation. In general, m?' hing 
produces results, particularly scaled-score means, for all equating 
procedures that are similar but that over- or underestimate the cri- 
terion scaled-score means. Because the results (i.e., scaled-score 
means) are similar across methods, theeftVct can be quite mislead- 
ing in that, in the absence of a criterion, one could conclude that 
because consistent results are obtained across methods the results 
are close to "truth/* This was not found to be the case for the situ- 
ations investigated in this study, and matching cannot be recom- 
mended as a procedure for rectifying the proble^n of sample 
ability-level differences. 



INTRODUCTION 

Practitioners working on large-scale admissions testing pro- 
grams are typically faced with the situation of having to 
equate test forms taken by groups that differ systematically 
in ability. In this situation, the equating is usually performed 
using an anchor-test design, where the anchor can be a set 
of common items embedded in the forms to be equated. An 
additional concern of individuals erigaged in the equating of 
achievement tests, given at multiple administrations that 
span the school year, is that such tests have been specifically 
designed to reflect course content. This may affect students 
taking the tests at different points in their coursework in dif- 
ferent ways. These students may not only differ in ability. 
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but they may also not constitute samples from the same pop- 
ulation. For instance, students who elect to take a test at a 
spring administrarion are usually fini hing a course of in- 
struction in the content covered by the test. Students who 
typically elect to take the test at a fall administration m y 
not have had formal coursework for some time. 

Underlying any commonly accepted definition of 
equated scores is the requirement that the equating transfor- 
mation be sample independent, that is- the raw-score to 
scale-score conversion function should be the same, regard- 
less of the sample or samples from the population from 
which it is derived (see Angoff 1984 or Holland and Rubin 
1982). In the context of equating achievement tests, this re- 
quirement would suggest, for instance, that the equating 
tran«:formation for a new test form should be the same, re- 
gardless of the time of year the sample data u ^ ' in the 
equating experiment were collected. 

Achievement Tests administered for the College 
Board's Admissions Testing Program (ATP) have histori- 
• cally been equated using an anchor-test design, using a :ct 
of internal common items as the anchor test (see Angoff 
1984 for a description of this type of design). New forms of 
the Achievement Tests are administered and equated using 
new- and old-form data collected at an administration oc- 
curring in the same month but in different years. Typically, 
the old-form sample is selected from an administration that 
occurred perhaps two or three years prior to the administra- 
tion of the new form. In spite of the time lag between the 
administrations of the new and old forms, the two samples 
are usually similar in composition, ability level, and pr pa- 
ration or coursework relevant to the particular achievement 
test. Recently, it has become necessary, because of an in- 
crease in the volume of new and levised forms requiring 
equating, to consider introducing new forms at administra- 
tions where comparable old-form data may not exist. This 
raises a question regarding the adequacy of an equating 
transformation that is obtained, for instance, when a new- 
form sample from a spring administration of a test is used 
along with an old-form sample from a fall administration. 
There is existing evidence that this situation may seriously 
affect equating results. 

Cook, Eignor, and Taft (1988) examined the results of 
equating two forms of the Achievement Test in Biology. In 
their study, one old-form sample and two different new- 
form samples provided the data for equating. The old-form 
sample was randomly selected from a fall administration of 
the test. One of the new-form samples was randomly se- 
lected from a spring administration of the test; the second 
sample was randomly selected from a fall administration. 
Groups of students electing to take the Biology Test at the 
fall and the spring administrations vary greatly in the abili- 
ties measured by the test. Students taking the test in the 
spring typically have finished or are about to finish a course 
in the content covered by the test, whereas those taking the 
test in the fall may have taken the relevant biology course 6 



to 18 mcnths prior to the test administration. Cook et al. 
equated the two forms of the Biolog) Test using conven- 
tional linear procedures (T\icker or Levine equating models, 
depending on ability-level differences; see Angoff 1984) 
and chained equipercentile (Design V in Angoff 1984) and 
three parameter logistic (3-PL) model Item Response 
Theory (IRT) curvilinear equating procedures (see Lord 
1980). All equatings were first carried out using the spring 
new-form and fall old-form pairing and then using the fall 
new-form and fall old-form pairing. Equating results were 
quite different for these two sample combinations. The 
equatings based on the combination of the spring new-form 
and fall old-form samples resulted in scaled-score means at 
least 15 points hi^Utir than those based on the combination 
of the fall new-fotm anu laU old-form samples. There was 
also a g(K)d deal of varia.-ion across the equating iesults for 
the various methods perfc rmed using data from the spring 
new-form and fall old-fur.vi samples. The results of the 
CcK)k et al. study clearly demonstrated the need to define 
the p<)pu!ation of students for whom the equating transfor- 
mation is applicable. 

Recently, in an attempt to ameliorate differences in 
anchor-test equaling results caused by samples that differ in 
ability, researchers have begun to study the effects on equat- 
ing procedures of matching one of the samples to the other, 
usually through scores on either a set of internal common 
items or an external anchor test. Lawrence and Dorans 
(1988) studied the effects of matching, using anchor-test 
scores, on Scholastic Aptitude Test (SAT) verbal and math- 
ematical equatings. The results of the Lawrence and Dorans 
study suggested that matching the differing ability samples 
in the Cook, Eignor, and Taft (1988) study (i.e., the spring 
new-form and fall old-form samples) on an available covar- 
iatc prior to equating might diminish the differences in 
equating results seen in that study. In addition, the equatings 
of the combination of the fall nev/-form and fall old-form 
samples from the Cook, Eignov, and Taft study provided a 
useful evaluative criterion for any matching done with the 
spring new-form and fall old-form combination in that the 
combination of the two fall samples provided equating 
results that were obtained under what has traditionally 
been considered to be the most satisfactory operational 
conditions. 

With the above in mind. Cook, Eignor, and Schmitt 
(1988) examined the effects on IRT and conventional 
(Tucker, Levine unequally reliable, and chained equipercen- 
tile) equating results of matching fall-spring new- and old- 
form Biology samples, using two different co\ariate mea- 
sures: (1) observed scores on the internal common-item 
equating block, and (2) responses to selected questions on 
the Student Descriptive Questionnaire (SDQ), a question- 
naire that examinees respond to on a voluntary basis when 
filling out their registration forms for the test. An older ver- 
sion of the SDQ was used in the study; it has since been 
revised. The criteria used in the Cook, Eignor, and Schmitt 
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study were the Tucker equatings based on the groups taking 
the new and old forms in the fall of the year. 

The results of the Cook, Eignon and Schmitt (1988) 
study indicated that (1) matching on a set of common items 
provided greater agreement among the results of the various 
equating procedures studied, (2) for all equating proce- 
dures, the results of the common-items matched group 
equating agreed more closely with the criterion equating 
than did the results of the unmatched fall and spring equat- 
ings, and (3) matching on the selected questions from the 
SDQ did not improve results over the fall-spring landom 
group equatings, possibly because a large number of exam- 
inees did not respond to the questions. The results of match- 
ing on scores on the common items provided an optimistic 
view about the possibility of being able to introduce new 
forms of the Biology Test at administrations where compa- 
rable old-form data did not exist. 

The purposes of the present study were twofold: ( 1 ) to 
determine whether the improved fall-spring equating results 
brought about in the Cook, Eignor, and Schmitt study by 
matching on scores on common items generalize to ATP 
Achievement Tests in other content areaj? besides Biology, 
and (2) to continue to investigate the possibility of using 
covariates other than scores on an internal set of common 
items for matching purposes. Achievement Tests in Mathe- 
matics Level II, American History and Social Studies, 
Chemistry, and French were analyzed in the present study. 
Results f ■ the previous study of Biology are included in 
this papti ' coipparative purposes. However, in the pres- 
ent study, IRT curviliiiear true-score equating has been re- 
placed by another curvilinear equating technique based on 
observed scores, frequency estimation equipercentile equat- 
ing (see Angoff 1984, p. 113). Finally, for the French Test 
only, matched sample equatings were also performed using 
responses to selected questions on the SDQ (the old, not the 
revised version), to see whether the inadequate results noted 
for the Biology Test were replicated with these data. In ad- 
dition, for the French Test only, responses to a question on 
amount of coursework contained on a Background Ques- 
tionnaire (BQ) that appears with that test were used to 
match samples on ability. The French Test was the only test 
in the set being studied for which suitable (for matching) 
BQ and SDQ response data had been collected. 



METHODOLOGY 

Tests, Test Forms, and Samples 

All tests in this study are formula scored with a correction 
for random guessing. The Biology, Mathematics Level II, 
American History and Social Studies, and Chemistry Tests 
all contain five-option multiple-choice questions, whereas 
the French Test contains four-option multiple-choice ques- 
tions. Table 1 contains the numbers of items in the new and 
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old forms of each of the five tests in the study, along with 
the numbers of items in the common-item sets. It also con- 
tains information on when the new and old forms were ad- 
ministered. Fcrmula scores on each of the Achievenu^nt 
Tests under study are transformed to scaled scores on a 
200-800 scale, typically using conventional linear^and 
equipercentile equating methods. A separate 200-800 scale 
exists for each test, and this scale is used for score reporting 
purposes. 

Following is a description of the samples used in the 
various equatings for each test. An outline of the content 
and skills tested by each of the five Achievement Tests used 
in this study is found in Table 2. Each of the test forms for 
each of the tests was developed in accordance with these 
content specifications. 

Biology 

Random samples of approximately 2,5(X) examinees were 
selected from the total groups taking the new form at the fall 
administration and the old form at the spring and fall admin- 
istrations. In addition, two matched samples from the 
spring old-form group were selected. One of these samples 
was selected in a nonrandom fashion to match the distribu- 
tion of scores on the common items of the fall new-form 
sample. The other old-form matched sample was selected to 
match fall new-form group-cell frequencies in a bivariate 
cross-tabulation of responses to two questions from the 
SDQ. one question having to do with self-reported grades 
in biological sciences and the other with self-reported scien- 
tific ability (in comparison with others in the same age 
group). In forming both old-form matched samples, stu- 
dents were selected from the total spring old-form group 
(/V = 23,369). 

Figure 1 presents frequency distributions for the 36 
common items administered to the fall new-form sample 
and the spring old-form total group. Figure 1 also contains 
the frequency distribution fcr the spring old-form matched 
group, where the matching variable was the score on com- 
mon items. As can be observed in the column of percent- 
ages - low, the spring old-form total group had a higher 
proportion of scores at the top of the 36-item range (approx- 
imately 50 percent scoring below 23 J. The fall new-form 
scores are distributed across a wider range of the common- 
item scores with approximately 50 percent scoring below 
18. Because of the size of the spring old-form total group 
(/V = 23,369), it was possible to create a spring old-form 
matched group that matched exactly the frequency distribu- 
tion of the fall new-form group. 

Figures 2 ".nd 3 present SDQ question 15 (self-reported 
grades in biological sciences) and quei^tion 58 (self-reported 
scientific ability), respectively. The manner in which re- 
sponses were recoded for cross-tabulation puqx^ses and the 
correlations between responses on these questions and Biol- 
ogy Achievement Test scores for the fall new-form and 
spring old-form groups are reported in each of these figures. 



The correlations for each of these two questions are in the 
low to mid-4()s, with slightly higher correlations for the 
spring old-form group. 

Cross-tabulations of the recoded responses to SDQ 
questions 15 and 58 are displayed in Figure 4. The fre- 
quency in each cell of the cross-tabulation for the spring 
a)ld-form group was matched to the corresponding cell fre- 
quency in the fall new-form group. Again, because of the 
size of the spring old-form total group, it was possible to 
create a spring old-form matched group that matched ex- 
actly, cell by cell, the bivariate cross-tabulation for the fall 
new-form group. Approximately half of the fall new-form 
group and of the spring old-form group (44 percent and 49 
percent, respectively) failed to respond to either one of the 
two questions. 

Formula-score summary statistics for the total tests and 
the 36-item common-item set for the five individual samples 
arc presented in Table 3. The mean of the common-item set 
is represented as a percentage of the maximum possible 
score to provide an indication of the ability level of the five 
samples that is not dependent on the number of common 
items. The fall new- and old-form random groups and the 
matched common-items spring old-form group are closely 
matched in ability. The spring old-form random group had a 
considerably higher mean score on the common-item set, 
and the mean for the matched SDQ spring old-form group 
varied only slightly from the mean for this group. 

Mathematics Level II 

Random samples of approximately 2,(KK)~2,31M) examinees 
were selected from the total groups taking the new form at 
the fall administration and the old form at the spring and fall 
administrations. In addifion, a matched sample from the 
spring old-form group was created. This sample was se- 
lected in a nonrandom fashion to match t^e distribution of 
scores on the common items of the fall new-form sample. 
In forming this old-form matched sample, students were se- 
lected from the total spring old-form group available in the 
data base (/V = 2,009). Because of the size of the spring old- 
form total group, a spring old-form matched group could 
not be created to match exactly the frequency distribution of 
scores on the common items of the fall new-form group. 
The matched old-form group was created by proportional 
sampling in a manner such that the {k, , rtage below at each 
common-item score for the snriiu oid-form matched group 
matched as closely as possiMe the corresponding percent- 
age beiow at eiv.h score p^wA t^»e fall new-form group. 

Figure 5 presents frequc .cy distributions for the 
common items administered to the fall new-form group and 
the spring old-form total group. As can be obser\^•d in the 
column of percentages below, a higher proportion of the 
spring old form total group scored at the top of the 19-item 
range (api> jximately 50 peu -nt scoring at or above 10). 

Formula-score summary statistics for the two total 
tests and the 19-item common-item set for the four individ- 
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ual samples are presented in Table 4. The fall new-form and 
old-form random groups and the matclied common-items 
spring olJ-form group were closely matched in ability, as 
measured by the con.iknon items, while the spring old-form 
random group obtained a higher score on the common- 
item set. 

American History and Social Studies 

Random samples of approximately 2,000-2,100 examinees 
were selected from the total groups taking the new form at 
the fall administration and the old form at the spring and fall 
administrations. In addition, a matched sample from the 
spring old-form group was created. This sample was se- 
lected ir a nonrandom fashion to match the distribution of 
scores on the common items of the fall new-form sample. 
In forming this old-form matched sample, students were se- 
lected from the total spring old-form group available in the 
data base (iV = 2,031). Like Mathematics Level II, the size 
of this spring old-form total group precluded an exact 
matching of frequency distributions on the common items, 
and hence, proportional sampling (matching relative fre- 
quencies as closely as possible) was used to create the 
spring old-form matched group. 

Figure 6 presents frequency distributions for the 20 
common items administered to the fall new-form sample 
and the spring old-form total group. As can be observed in 
the column of percentages below, a higher proportion of the 
spring old-form group scored at the top of the 20-item range 
(approximately 50 percent scoring at or above 1 1). 

Formula-score summary statistics for the total tests and 
the 20-item common-item set for the four individual 
samples are presented in Table 5. The fall new-form and 
old-form random groups and the matched common-items 
spring old-form group were closely matched in ability, 
as measured by the common items, while the spring old- 
form random group obtained a higher score on the common- 
item set. 

Chemistry 

Random samples of approximately 2,(KX)-2,200 examinees 
were selected from the total groups taking the new form at 
the fall administration and the old form at the spring and fall 
administrations. As with the other tests, a matched sample 
from the spring old-form group was again created. This 
sample was selected in a nonrandom fashion to match the 
distribution of scores on the common items of the fall new- 
form sample. In forming this old-form matched sample, 
students were selected from the total spring old-loim group 
available in the data base (A^ = 2,206). As was the :ase for 
the last two tests discussed, proportional sampling was 
again used in an attempt to match relative frequency distri- 
butions on the common items, resulting in the spring old- 
form matched group. 

Figure 7 presents frequency distributions for the 20 
common items administered to the fall new-form sample 
and the spring old-form total group. As can be observed in 
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the column of percentages below, a higher proportion of the 
spring old-form group scored at the top of the 20-item range 
(approximately 55 percent scoring at or above 11). 

Formula-score summary statistics for the two tests and 
the 20-item common-item set for the four individual 
samples are presented in Table 6. The fall new-form and 
old-form random groups and the matched common-item 
spring old-form group were closely matched, whereas the 
spring old-form random group obtained a higher score on 
the common-item set. 

French 

Random samples of approximately 6,100 examinees were 
selected from the total groups taking the new form at the fall 
administration and the old form at the spring and fall admin- 
istrations. In addition, three matched samples from the 
spring old-form group were selected. One of these samples 
was selected, as was the case for the tests previously dis- 
cussed, in a nonrandom fashion to match the distribution of 
scores on the common items of the fall new-form sample 
group. The second old-form matched sample was selected 
to match fall new-form group-cell frequencies in a bivariate 
cross-tabulation of responses to two questions from the 
SDQ, one having to do with years of study in foreign lan- 
guages and the other with self-reported grades in foreign 
languages. The third old-form sample was selected to match 
fall new-form group frequencies of response to nine cate- 
gories of a question on the BQ, concerning the manner in 
which candidates obtained their knowledge of French. In 
forming these old-form matched samples, students were se- 
lected from the total spring old-form group available in the 
data base (A^ = 6,086). As with previous examinations, pro- 
portional sampling (matching of relative frequencies) was 
used to create each of the spring old- form matched groups. 

Figure 8 presents frequency distributions for the 21 
common items administered to the fall new-form satnple 
and the spring old-form total group. As can be observed in 
ihe column of percentages below, a higher proportion of the 
spring old-form group scored at the top of the 2 1 -item range 
(over 50 percent at or above 9). 

Figures 9 and 10 present SDQ question 8 (years of 
study of foreign languages) and question 14 (self-reported 
grades in foreign languages), respectively. The m.anner in 
which responses were recoded for cross-tabulation purposes 
and the correlations between responses on these ques- 
tions and French Achievement Test scores foi the fall new- 
form and spring old-form groups are reported in each of 
these figures. The correlations of Achievement Test scores 
with each of these two questions are in the range from .27 
to .38, with the spring old-form group obtaining the higher 
correlations (.32 and .38). 

Cross-tabulations of the recoded responses to SDQ 
questions 8 and 14 are displayed in Figure 11. The fre- 
quency in each cell of the cross-tabulation for the spring 
old-form total group was matched to the corresponding cell 
frequency in the fall new-form group. Exact matching of 
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cell frequencies w:^s possible for all but one cell of the bi- 
variate cross-tabulations. Approximately 52 percent of the 
fall new-form group and 30 percent of the spring old-form 
group failed to respond to either one of the questions. 

Figure 12 presents the background having to do with 
how candidates obtained their knowledge of French. Figure 
13 presents the disiributions of responses to the BQ ques- 
tions used in matching the spring old-form group to the fall 
new-form group. Proportional sampling was again used to 
form the spring old-form matched group. 

Formula-score summary statistics for the two tests and 
the 21 -item common-item set for the siv individual samples 
are presented in Table 7. The fall new-form and old-form 
random groups and the matched common-items spring old- 
form group were closely matched, whereas the spring old- 
form group and the matched SDQ and matched BQ spring 
old-form group's scores on the common items were higher. 
The matched SDQ group varied only slightly from the 
spring old-form random group, whereas the matched BQ 
group was the most able of all samples. 

Conventional Item Statistics (Equated Deftas) 

The measure of item difficulty used in this study is the delta 
index, which is a transformation of the percentage or pro- 
portion of the group who answered the item correctly (see 
Hechtand Swineford 1981 o\ Henrysson 1981). Since del- 
tas are direct transformations of proportion correct, the ob- 
served statistics (observed deltas) are dependent on the abil- 
ity level of the group responding to the items. The solution 
to this dependency problem is to use item difiicultieis that 
are equated to some common scale (equated deltas). 

Equating item difficulties requires establishing a base 
scale on an appropriate group. The relationship between ob- 
served deltas and deltas on the base scale can be expressed 
as an equating transformation of the form = 
A(A^) + where A, is the equated delta value on the 
base score, A,, is the observed delta value to bf irans- 
formed, and A and B are parameters estimated from the ob- 
served (for new-form sample) and equated (for oid forn. 
sample) deltas for the common items. The resulting trans- 
formation is then applied to all items in the teM. Once the 
base scale is established, and once all observed deltas are 
equated to it, equated deltas can then be used to compare 
item difficulties that are not dependent on the different abil- 
ity levels of the varying groups taking the items. 

Of particular inte:est in this study are the relalionships, 
for each test, between the equated deltas for the common 
items in the fall new-form random group and each of the 
various old-form groups formed for each of the tests. 

Score Equating Methods 

The five score-equating methods that were used in this study 
are ( I) Tucker, (2) Levine equally reliable linear, (3) Levine 
unequally reliable linear, (4) chained equipercentile, and (5) 



frequency estimation equipercentile curvilinear methods. 
Each of these methods defines equated scores in a slightly 
different fashion. 

Linear equating procedures used in this study em- 
ployed the Tucker and the Levine equally and unequally re- 
liable models (Angoff 1984, pp. 109-1 15). For these mod- 
els, scores on the comnion iiems are used to estimate 
performance of the combined group of examinees on both 
the old and new forms of the test, thus simulating by statis- 
tical methods the situation in which the same group of ex- 
aminees takes both forms of the test. 

Chained equipercentile curvilinear equating estab- 
lishes equivalency for scores on each total test and asso- 
ciated common-item set first by equipercentile methods 
separately within each group (Design V in Angoff 19? . 
116). Scores on the two test forms are then said to be equiv- 
alent if they correspond to the same score on the common- 
item set. 

Smoothing of equipercentile results can take place at 
two different points. Practitioners can either smooth the fre- 
quency distributions to be used in the equating or smooth 
the results obtr^ined from the equating (see Fairbank 1987). 
in this study, a univariate smoothing technique, called 
^Tukey-Oarcton seven-point smoothing" (Cureton and 
Tukey i95l; see alsvi. Angoff 1984, p. 12) was applied to 
the fr;*quency distributions to be used in the equipercentile 
equatiugs carried out through an anchor test. 

Frequence e sfimatioti equipercentile curvilinear 
equating follows the logic of a single-group equating in that 
it equates new and old forms of a test on the basis of the 
total test-score distributions of the two forms in a single 
group of examinees. Usually, this group is the combined 
group formed from the groups faking the two tests to be 
equated. The total test-score distributions are estimated dis- 
rributions, however, rather than observed distributions. 
These distributions arc estimated using information con- 
tained in the cross-tabulations of common-item scores and 
totril test scores. (See Angoff 1984, p. 113, for a further 
discission of this procedure.) 

hV the frequency estimation equatings in this study, 
ssnoothing techniques wee used at two points. A bivariate 
smcK:lhing technique, attributable to Rosenbaum and 
Thuyer ( 1987) and referred to as Model 4 in Rosenbaum and 
Thayer (1986), was used to smooth the joint distributions of 
total test scores and anchor-test scores in the two separate 
groups. The "Tukey-Cureton seven-point smoothing" tech- 
nique was later applied to the univariate estimated total-test 
frequency distributions prior to equating. 

Criterion Measures 

Criterion measures for item analyses and score equating re- 
sults are described in this section. For each test being stud- 
ied, item analyses results were evaluated by comparing cor- 
relation coefficients and plots of difficulty indices (equated 
deltas) for the common items icr the different sampling 
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combinations (fall-fall random groups, fall-spring random 
groups, fall-spring matched groups common items, fall- 
rpring matched groups SDQ [Biology and French only], 
and fall-spring matched groups BQ [French only]). The fall- 
fall random groups' correlation coefficient and delta plot re- 
sults were used as criteria to evaluate the results for the other 
sample combinations. The fall-fall random sample combi- 
nation was chosen as the sample used to define the criteria 
because it provides equating results obtained under what 
has traditionally been considered to be the most satisfactory 
operational conditions. 

Based on the results of the delta equatings for each 
new-form and old-form sample combination for each test, 
the new-form total-test equated delta mean (an indicator of 
total-test difliculty) and standard deviation were calculated. 
If the individual deltas are stable across new-form and old- 
form sample combinations, then the estimated total-test 
equated delta means and standard deviations should be sim- 
ilar. The fall-fall random groups estimated total-test equated 
delta mean and standard deviation were used for each test as 
the criteria in the comparisons. 

For each test, score equating results for the linear 
equatings — Tucker, Levine equally reliable, Levine un- 
equally reliable — and the nonlinear equatings — chained 
equipercentile and frequency estimation equipercentile — 
were compared by evaluating the within-method differences 
between the fall-spring random or matched groups equat- 
ings and the fall-fall random criterion equatings. In addi- 
tion, for each test, across-method comparisons were made 
by examining the diff erences of each method and sampling 
combination from the fall-fall random groups Tucker crite- 
rion equating (the equating method used to report scores). 
In order to judge the importance of the results for the vari- 
ous models and sample combinations, standard errors were 
computed for the fall-fall random Tucker criterion equating. 
Confidence bands of ±2 standard errors were plotted 
around the Tucker criterion on difi>?rence or residual plots 
that were used to evaluate :>elected equating results. The 
computer program AUTEST (Lord 1975) was used to com- 
pute the standard errors. 



RESULTS AND DtSCUSSION 
Item Analyses 

The results of the conventional item analyses for the five 
tests analyzed in this study are presented in Tables 8 through 
12 and Figures 14 through 18. 

Biology 

An examination of the correlation ccH»tlicients in Table 8 ob- 
tained for the item difficulty indices (equated deltas) com- 
puted using the diflerent sample combinations taking the 
Biology Achievement Test indicates that these coelficients 
range from .73 to .99. The highest correlation coefficient 
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obtained (.99) was for delta values computed using the 
samples providing the criterion equating results (i.e., the 
random samples drawn from the fall new-form and fall old- 
form populations. The lowest correlation coefficient (.73) 
was obtained for deltas computed using the random samples 
selected from the fall new-form/spring old-form popula- 
tions prior to any attempt to match the two groups on ability 
level. The correlation coefficients observed for deltas com- 
puted using samples obtained by ihe two matching proce- 
dures were quite similar (.79 and .77 for the groups 
md'iched using common items and responses from the SDQ, 
respectively). 

Remember that the two matching procedures were not 
equally successful. As mentioned in the methodology sec- 
tion, matching on scores obtained on the common items 
produced an old-form matched sample that was close in 
ability to the new-form sample. On the other hand, match- 
ing the two groups using responses to the selected SDQ 
questions did not provide samples that were very similar in 
ability level. As a matter of fact, matching using SDQ re- 
sponses appears to have accentuated the disparity between 
the ability levels of the two groups. Given that one matching 
procedure was so much more effective than the other, it was 
expected that the relationship between item-difficulty in- 
dices obtained for the bettei matched groups would be dem- 
onstrated by a correlation coefficient that was a good deal 
higher than that obtained for the deltas computed using the 
poorly matched samples (matched using SDQ responses) 
and the unmatched fall new-form and spring old-form 
groups. 

Further insight into the behavior of the item difficulty 
indices can be obtained by examining plots of the equ;Jed 
delta values for the 36 common items administered to the 
four sample pairings (fall-fall random groups, fall-spring 
random groups, and the fall-spring groups matched on com- 
mon items and SDQ responses). It can be seen, from exam- 
ination of the plots presented in Figure 14, that for all 
sample combinations (with the exception of the fall-fall ran- 
dom groups combination, which provided the criterion 
score equating results), considerable scatter can be observed 
on either side of the solid diagonal line. The correlation 
coefficients shown under the plots, and in Table 8, are re- 
flective of this degree of scatter 

Again, note that neither o*'the matching techniques re- 
duced substantially the scatter near the diagonal line for the 
equated delta values computed using the new- and old-form 
samples. As mentioned earlier, matching samples on ability 
level using the tv/o SDQ questions was not successful, most 
likely because of the large number of examinees chcxising 
not to respond to the questions. U was expected, however, 
that once examinees were matched on ability level on the 
common-item set, delta values computed using the matched 
groups would show a close relationship, similar to that ex- 
hibited by the fall new-form and fall old-form random 
samples used to provide the criterion eqiiatings. The fact 
that scatter remains in this delta plot, despite the matching 
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of the new - and old-form samples, is troublesome, A likely 
explanation is that the data arc multidimensional, with the 
common-item portion of the tests measuring different abili- 
ties for the spring and fall groups, and matching on 
common-item scores does not remove this multidimension- 
ality from the data. In addition, matching on a fallible mea- 
sure, such as scores 0:1 a set of common items that contain 
some degree of measurement error» may not provide a 
match on true ability. 

Table 8 also contains estimated total-test ei Mated delta 
means and standard deviations for the new form of the test, 
based on the delta equatings obtained using tiie common 
items administered to the various samples. It is interesting 
to note that neither the means or the standard deviations of 
equated deltas appear to be affected by the groups used to 
perform the delta equatings. The equated delta means can 
be interpreted as indicating that the new form is estimated 
to be of similar difficulty level for all sample combinations, 
despite the differences in ability level that exist for fwo of 
the combinations. 

Mathematics Level II 

Table 9 and Figure 15 contain information summarizing the 
resuhs of the item analyses of the Mathematics Level 11 
Test. Correlations among the 19 common items appearing 
in the new and old form of the test are quite high for all 
sample combinations (.98-.99), and the correlation does not 
appear to be substantially affected by the differences in abil- 
ity levels of the spring old-form and fall new-form groups. 

The plots of equated delta values provided in Figure 15 
reflect the high correlation coefficients for the common 
items given to the various sampling combinations. It can be 
seen, from examination of the plots shown in Figure 15, 
that the item-difficulty indices exhibit very little scatter 
about the diagonal line drawn on these plots. 

It would appear that, although examinees differ in level 
of ability for the fall new-form and spring old-form random 
groups sample combination, the rank order of the diffic ulty 
indices for the common items remains similar across all 
sample combinations. The total-test equated delta means 
and standard deviations (shown in Table 9) remain consist- 
ent across the various sampling combinations. 

American History and Social Studies 

The information presented in Table 10 and Figure 16 sum- 
marizes the item analysis results for the American History 
and ScKial Studies Achievement Test. Correlation coeffi- 
cients for tnc 20 common items given to the fall new-form, 
fall old-form, and fall new-form/spring old-form random 
and matched groups vary between .93 and .94. 

Plots of equated delta values shown in Figure 16 are 
reflective of the correlation coefficients provided in Table 
10, showing some degree of scatter about the diagonal line. 
It would appear that, to some extent, the American History 
and Social Studies Test provides common-item data that are 
scmewhat intermediate in terms of scatter about the diago- 



nal line to that collected for the Biology Test and the Math- 
ematics Level 11 Test. The common-item portion of the tests 
may possibly be measuring slightly different constructs for 
the new- and old-form groups (as exhibited by the scatter of 
the common items). This observation seems to hold equally 
for the fall-fall random groups combination as well as for 
the two fall-spring combinations. 

It should be noted that il.e total-test eqi'dted delta 
means a- i standard deviations shown in Table 10 are quite 
consistent across the various sampling combinations. 

Chemistry 

Item analysis lesults for the Chemistry Achievement Test 
are shown in Table 1 1 and Figure 17. Examination of the 
correlation coefficients for the 20 common items appearing 
in the new and old Chemistry Test forms indicates some ef- 
fect of the differences in the ability level of the fall and 
spring groups on the relationship between the common 
items contained in the two forms. However, total-test 
equated delta means and standard deviations remain quite 
consistent across the different sampling combinations. 

Plots of the equated delta values for the Chemistry Test 
common items are shown in Figure 17, Examination of 
these plots indicates slightly more scatter of equated delta 
values about the diagonal line for the items taken by the fall- 
spring combination than for those taken by the fall-fall com- 
bination. Matching the new- and old-form groups on ability, 
on the basis of the common-item scores, has little or no ef- 
fect on the scatter of the equated delta values. 

French 

The final set of item analysis results, those Mained for the 
French Achievement Test, are provided in Table 12 and Fig- 
ure 18. The information provided in Table 12 indicates that 
the correlation coefficients between the 21 common items 
contained in the new and old forms of the test, as well as the 
total-test equated delta summary statistics, are remarkably 
similar across four of the sampling combinations. The fall- 
spring matched on BQ combination, however, provided a 
somewhat diflerent set of total-test equated delta summary 
statistics. 

The plots of equated delta values shown in Figure 18 
illustrate graphically the close agreement of the equated 
delta values for the various sampling combinations, in spite 
of the considerable differences in ability levels of the fall 
and spring groups. The relationship between the item- 
difficulty values seems not to be particularly affected by dif- 
ferences in group ability. 

Comparison of Item Analysis Results Across Tests 

To summarize, item analysis results (correlations between 
equated delta values) appear to be most affected by the dif- 
ferences in ability exhibited in the fall-spring sampling 
combinations for the Biology Achievement Test and least 
affected for the French and Mathematics Level II Tests. Item 
analysis results tor the American History and Social Studies 
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Test indicate only a slight decrease in equated delta correla- 
tions for the fall-spring: I'andom groups results compared 
with results obtained for the fall-fall random groups combi- 
nation. The results of the item analyses carried out for the 
Chemistry Test indicate the next largest drop in correlation 
between equated delta values obtained for the fall-spring 
random groups as compared with the results obtained by the 
fall-fall random groups. The largest drop for any test stud- 
ied, as mentioned previously, was observed for the Biology 
Test. These results make sense intuitively, in that science 
tests are more closely tiri to the content of a single course 
than are tests in mathematics or a foreign language, which 
measure skills obtained from several courses on the topic. 
Therefore, one might expect, because of recency of instruc- 
tion and, possibly, some degree of forgetting, that differ- 
ences would occur in the behavior of common items con- 
tained in science-test forms when given to students just 
completing a course in the subject matter (a spring group) 
and when giver to students who may not have studied the 
subject matter for some time (a fall group). 

Equatings 

Scaled-score summary statistics resulting from the experi- 
mental equatings performed for the five tests are summa- 
rized in Tables 13-17. Insight into the performance of the 
various equating methods when applied to the different 
sample combinations for the five tests can be obtained 
through a review of the equating plots provided for these 
tests in Figures 19-33. 

Biology 

Examination of the information presented in Table 1? indi- 
cates that the scaled-score summary statistics obtai ned for 
all of the Biology Test equatings performed using the ran- 
dom samples selected from the fall new-form and fall ola« 
form combination are very similar Remember that the cri- 
terion equating for the study was specified as being the 
equating actually used to report scores for the fall new-fo'.m 
group. Scores were reported for Biology Test takers r,:;ing 
the fall-fall random group Tucker results. However, it is 
clear from examination of the information provided »n Table 
13 that all procedures used with this sampling combination 
would have provided similar total-group scaled-score means 
and standard deviations. 

It is important to note the differences in equating re- 
sults between those obtained from the fall-fall random 
groups and those obtained from the groups that were ran- 
domly selected from the fall new-form and spring old-form 
populations. It is clear from the summary statistics provided 
in Table 13 that the differences in ability levels of the new- 
and old-form groups taking the Biology Test (see Table 3) 
affect the equating results obtained by the four equating pro- 
cedures differentially. An interesting point to note, when ex- 
amining these results, is how little the TUcker and frequency 
estimation equipercentile procedures are affected by the dif- 



ferences in ability levels of the new- and old-form samples. 
Comparison of the scaled-score summary statistics obtained 
for these experimental conditions with the results obtained 
for the Tucker procedure applied to the fall-fall random 
groups saik.pling combination shows very little difference in 
these results. On the other hand, the other three equating 
procedures investigated (Levine equally reliable, Levine 
unequally reliable, and chained equipercentile) appear to be 
quite affected by the ability differences represented by the 
fall new-form and spring old-form random groups combi- 
nation. The equating results that appear to be the most af- 
fected are those based on the Levine equally and unequally 
reliable methods. A point to note is the inconsistency of the 
results of the procedures for this particular new-form and 
old-form pairing. It would be difficult identify the most 
appropriate set of results to use for score reporting in the 
absence of some criterion. 

Examination of the information presented in Table 13 
for the equatings based on the fall new-form and spring old- 
form groups matched using scores obtained on the common 
items indicates remarkable consistency among the summary 
statistics obtained using all five equating procedures. One 
important point to note is that, although matching provides 
a definite improvement (in relation to the criterion scaled- 
score mean and standard deviation) over the fall-spring ran- 
dom groups equating results for the two Levine procedures 
and chained equipercentile procedure, this improvement is 
not noted for the Hicker and frequency estimation equating 
procedures. The summary statistics produced by the Ibcker 
and frequency estimation equatings based on the fall-spring 
random groups are closer to those produced by the criterion 
equating than are the summar) statistics resulting from the 
Tbcker and frequency estimation equatings based on the 
fall-spring matched (common items) groups. 

As expected, the equating results based on the Biology 
fall new- form and spring old-form groups matched on the 
responses to the two SDQ questions produced summary sta- 
ti.uics that were discrepant from the criterion equating sum- 
mary statistics. Several points should be noted when exam- 
ining these data. First, a good deal of inconsistency among 
the scaled-score summary statistics produced by the various 
equating procedures exists, with the results produced by the 
two Levine procedures and the chained equipercentile pro- 
cedure providing lower scaled-score means than either the 
Tucker or frequency estimation equipercentile resulis. 
Using the Hicker results from the fall-fall random groups as 
a criterion, it is apparent that the two Levine procedures 
used with this sampling combination produce the results 
that are most discrepant from the criterion. The Hicker and 
frequency estimation equipercentile procedures produce re- 
sults that most closely approximate the criterion results. 

Plots presented in Figure 19 allow the comparison of 
the Biology Test equating results obtained by a particular 
equating procedure (e.g., Tlicker) for the four Biology Test 
sampling combinations. Panei (a) contains results for the 
Hicker procedure; panel (b) contains results for the Levine 
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equally reliable procedure; panel (c) contains results for the 
Levine unequally reliable procedure; panel (d) contains re- 
sults for the chained equipercentile procedure; and panel (e) 
contains results for the frequency estimation equipercentile 
procedure. Each panel contains plots of differences in 
scaled scores, with the fall-fall random groups equating re- 
sults subtracted from the fall-spring random or matched 
groups equating results. In each case, the fall-fall results 
used for comparison are those obtained by the particular 
equating procedure that is being evaluated. 

Examination of the results obtained in panel (a) of Fig- 
ure 19 shows the close agreement among the Tbcker results 
under all sampling conditions that was illustrated by the 
TUcker scaled-score summary statistics for the Biology Test 
found in Table 13. Only the residuals obtained from the 
comparison for the fall-spring groups matched on SDQ 
responses are greater than five scaled-score points, and 
this is found only for raw scores in the lower end of the 
distribut on. 

The results presented in panels (b) and (c) for the 
Levine equally and unequally reliable methods are similar 
to each other and quite different from those shown in panel 
(a). It is clear that for both Levine methods the only results 
that closely approximate the results obtained under the fall- 
fall random condition are the results obtained for the fall- 
spring groups matched on common items. For both Levine 
procedures, the fall-spring random groups and the fall- 
spring groups matched on SDQ responses produce equating 
results that are quite discrepant from the fall-fall random 
groups equating. 

The results presented in panel (d) for the chained equi- 
percentile procedure are difficult to interpret because of the 
irrepularities in the raw to scale conversions. However, it 
can be seen that the results that most closely match those 
obtained for the fall-fall random groups combination are 
those for the fall-spring groups matched on common items. 

The frequency estimation equipercentile equating re- 
sults are summarized in panel (e) of Figure 1::^. An exami- 
nation of these results indicates a degree of similarity to the 
results obtained by the Tlicker linear procedure in that the 
closest results to the criterion equating (frequency estima- 
tion equipercentile for the fall-fall random groups in this 
case) are provided by the fall-spring random groups equat- 
ing. The most disci epant results are produced by the fall- 
spring groups matched using information from the SDQ. 
Note that all results depart considerably from the criterion 
results for scaled scores corresponding to raw scores below 
20; this is most likely partly due to instability in the equat- 
ings related to the small number of frequencies in this por- 
tion of the score distributions. 

The Biology Test residual plots presented in Figure 20 
have been developed to permit an examination of the agree- 
ment among ♦he five equating procedures applied to a par- 
ticular sampling combination. The plots shown in panel (a) 
of Figure 20 compare results for the Tlicker, iwo Levine, 
chained equipercentile and frequency estimation equiper- 
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centile equating methods used with the fall-fall random 
groups samples. Panel (b) of Figure 20 summarizes the le- 
sults of the fall-spring random groups equatings. The fall- 
spring groups (matched on common items) equatings are 
summarized in panel (c), and the equatings performed using 
the fall-spring groups (matched on SDQ responses) are 
summarized in panel (d). Note that the same criterion equat- 
ing is used in each plot, that is, the results of the Biology 
Test Tlicker fall-fall random groups equating. In addition, a 
confidence band of ±2 standard errors has been plotted 
around the criterion equating in each plot. If the differences 
between the criterion and experimental results fall within 
this band, it can be concluded that these equating differ- 
ences are no larger than what would be expected from sam- 
pling fluctuations (with respect to the criterion) alone. 

Examination of the Biology Test equating-difference 
plots shown in panel (a) of Figure 20 indicates that all 
equating methods, applied to data obtained from the fall-fall 
random groups combination, produce results that are in 
fairly close agreement. Results of the two Levine proce- 
dures and frequency estimation equipercentile equating pro- 
cedure agree very closely with those obtained using t!ie 
T\icker model. The chained equipercentile results are gen- 
erally close to the criterion results and do not show a ten- 
dency to consistently over- or underestimate equated scores. 

Panel (b) contains the results of applying the five 
equating procedures to data obtained from the Biology Test 
fall new-form and spring old-form random groups sampling 
combination. Examination of the plots indicates that both 
the 'I\icker and frequency estimation equipercentile proce- 
dures provide results that agree in a reasonable fashion with 
the criterion results. The difference line for the TUcker 
equating falls within the confidence band plotted around the 
criterion equating results for the full range of raw scores. 
Tb's is true also for the frequency estimation equipercentile 
results with the exception that discrepancies for this proce- 
dure fall out of the confidence band for low raw scores 
where there is little data. The Levine equally and unequally 
reliable and chained equipercentile procedures produce re- 
sults that are quite discrepant from the criterion equating 
results. Both the Levine and the chained equipercentile pre 
cedures underestimate the criterion scores for most of tlit 
raw score distribution. The equipercentile results do, how- 
ever, appear to more closely approximate the criterion re- 
sults than do the results of (he Levine procedures. 

The results shown in panel (c) of Figure 20 summarize 
the equating differences obtained from a comparison of the 
results of the ive equating methods used with the fall new- 
form and spring old-form groups (matched using scores on 
the common items) with results obtained for the Tlicker 
model applied to the fall-fall random groups combination. 
It can be seen that most of the equating differences resulting 
from application of the five procedures fall within the con- 
fidence band plotted around the criterion equating results. 

Panel (d) contains the residuals resulting from a com- 
parison of the criterion equating results with the equating 
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results obtained from application of the five equating pro- 
cedures to data from the fall new-form and spring old-form 
groups matchea using responses to the SDQ questions. The 
pattern of residuals observed in panel (d) is similar to that 
previously examined for panel (b), with the exception that 
the equating differences are even more pronounced for the 
groups matched using SDQ responses. It is clear, from the 
information presented in panel (d), that the Tucker and fre- 
quency estimation methods provide results that agree some- 
what with those obtained by the Tucker model applied to the 
fall-fall random groups combination. The equating proce- 
dures that provide unacceptably large residuals are the ones 
involving application of the Levine equally and unequally 
reliable and chained equipercenMle models. 

Figure 21 summarizes the results of the application of 
the five equating procedures to the four Biology TCvSt sam- 
pling combinations. New-form scaled-score means are plot- 
ted for each equating-method and sampling combination. A 
number of points that have already been mentioned become 
even more apparent from an examination of the plots shown 
in Figure 21 First, it is clear that the methods that provide 
reasonably consistent results across the four new-form and 
old-form sampling combinations are the Tucker and fre- 
quency estimation equipercentile methods. The next most 
stable results are provided by the chained equipercentile 
procedure. The least stable results are provided by the Le- 
vine equally and unequally reliable procedures. These pro- 
cedures appear to be severely affected by difFexnces in 
group ability, as exhibited in the results obtained for the fall- 
spring random and fall-spring matched (SDQ) sampling 
combinations. Additional points worth noting are: 

1. There is close agreement of ^he results of ail the 
methods for the fall-fall random groups combina- 
tion; 

2. Matching on common items promotes agreement 
among the five methods, with all five producing 
results that slightly underestimate the criterion 
scaled-score mean; and 

3. Better results (i.e, closer estimates of the criterion 
scaled-score mean) are obtained for the Tucker and 
frequency estimation equipercentile methods by 
using the fall-spring random groups combination 
and not matching. 

For Levine equally and unequally reliable and the chained 
equipercentile procedures, matching on common items 
clearly provides better results. 

The results of the application of the Tucker and fre- 
quency estimation equipercentile models to the various 
sampling combinations are interesting from several points 
of view. The most important point is that the scaled-score 
summary statistics produced by these methods applied to 
the fall new-turm and spring old-form random groups com- 
bination very closely approximate the criterion summary 
statistics. This implies that for this test, the Tucker and fre- 
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quency estimation equipercenfile methods are relatively un- 
affected by new-form and old-form sampling combinations 
that differ in ability level. This is certainly not true for the 
other three equating prcKedures applied to the Biology Test 
data. These procedures clearly appeared to be affected by 
the group ability differences displayed by the fall-soring 
random groups and fall-spring matched groups (SDQ) sam- 
pling combinations. 

Mathematics Level II 

The results of the experimental equatings carried out using 
data from the Mathematics Level II Test are summarized in 
Table 14 and Figures 22-24. Scaled-score summary statis- 
tics resulting from application of the five equating methods 
to this test appear in Table 14. Examinafion of the informa- 
tion presented in this table indicates that the scaled-score 
summary statistics obtained for all of the Mathematics 
Level 11 equatings performed using the random samples se- 
lected from the fall new-form and fall old-form combination 
are quite similar Remember that the criterion equating for 
the study was specified as the fall-fall random groups 
Tucker equating actually used to report scores for the fall 
new-form group. However, it is clear that all procedures 
used with this sampling combination would have resulted in 
similar reported scores. 

Differences between the results of applying the five 
equating procedures to the fall-fall sampling combination 
and to the fall-spring random groups combination are quite 
:*pparent from the scaled-score summary statistics presented 
in Table 14. The procedures that appear to be least affected 
by differences in group ability are the two Levine proce- 
dures, and the procedures most affected by group differ- 
ences are the Tucker and frequency estimation equipercen- 
tile prcKedures. 

The information presented in Table !4 for equatings 
based on fall-spring samples matched on their respective 
distributions of common-item scores are quite interesting. 
Matching provides similar results for all five equating pro- 
cedures and results that provide an overestimate of the cri- 
terion scaled-score mean and standard deviation. 

Further insight into the equating summary statistics 
presented in Table 14 for the Mathematics Level II Test can 
be obtained by examination of the equating-difference plots 
provided in Figures 22 and 23. Plot:^ shown in Figure 22 
provide comparisons of the results of a single procedure ap- 
plied to the three sampling combinations. The plots shown 
in Figure 23 permit comparison of different equating proce- 
dures applied to a particular sampling combination. 

Results contained in panel (a) of Figure 22 show the 
discrepancy among the Tucker result.,, with both the fall- 
spring random and fall-spring matched common-item re- 
sults overestimating the criterion scaled scores produced 
by the Ticker fall-fall random groups equating. It is interest- 
ing to note that matching had little effect on the fall-snring 
results The matched group equatings produced slightly . /- 
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er scaled scores than the fall-spring randoni groups equat- 
ing for the upper end of the distribution of raw scores 
and slightly higher scaled scores for the lower end of the 
distribution. 

Information provided in panels (c) and (c) for tht^ Le- 
vine equally and unequally reliable procedures, respec- 
tively, indicates that for both procedures the matched and 
random groups fall-spring combinations provide overesti- 
mates of the Levine equating results based on the fall-fall 
random groups combinations. For both procedures, match- 
ing, based on distributions of scores on the common-item 
set, provided results that overestimated scaled scores ob- 
tained for the fall-fall combination more than the results 
provided by the unmatched fall-spring combination. 

Examination of the information presented in panel (d) 
indicates that the results obtained using chained equipercen- 
tile equating agree closely for both the fall-spring random 
and fall-spring matched common-item groups and that re- 
sults of both of these procedures overestimate scaled scores 
obtained through application of the chained equipercentile 
procedure to data obtained for the fall-fall random groups 
combination. Finally, the results provided for the frequency 
estimation equijxricentile procedure (panel (e) of Figure 22) 
are similar to those obtained for the other four procedures 
(i.e., both matched and unmatched results overestimate the 
equating results obtained from the fall-fall sampling com- 
bination). 

The Mathematics Levtl II residual plots shown in Fig- 
ure 23 permit comparisons of the varioui^ equating methods 
applied to a particular sampling combination with the crite- 
rion equating results (Tucker applied to the fall-fall sam- 
pling combination). The plots shown in panel (a) compare 
results for the Tbcker, Levine equally and unequally reli- 
able, chained equipercentile and frequency estimation equi- 
percentile methods used with the fall-fall random groups 
samples. Close agreement among the various methods is ex- 
hibited for a raw-score range of approximately 5-40. The 
two equipercentile methods diverge from each other and 
from the linear criterion equating results for extreme raw 
scores. This is most likely due to the instabilities of both 
equipercentile procedures related to the small number of 
frequencies in the tails of the score distributions. 

Examination of the information provided in panel (b) 
of Figure 23 reveals that the only procedures (applied to 
the fall-spring random groups combination) that provide 
scaled-score discrepancies within ± 2 standard errors of the 
criterion equating are the Levine equally and unequally re- 
liable procedures. All other procedures appear to provide 
results that overestimate the criterion scaled scores through 
the middle portion of the score range. The scaled-score dis- 
crepancies plotted in p-^nel (c) are quite similar to those 
shown in panel (b), indicating that matching using distribu- 
tions of scores on the common items had little effect on the 
equating results. Exceptions are the results of the Levine 
equally and unequally reliable procedures. These proce- 
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dures seem to have been made worse by the matching. For 
the Levine procedures, the matched results produce scores 
that are a considerable overestimate of the Mathematics 
Level II criterion scores. 

Figure 24 summarizes the results of the application of 
the five equating procedures to the three Mathematics Level 
11 sampling combinations. New-form scaled-score means 
are plotted for each equating-method and equating-sam- 
pling combination. From examination of Figure 24, it is ap- 
parent that all of the equating procedures are affected to 
some extent by the difference in group ability exhibited by 
the fall-spring random groups combination, with the Hicker 
and frequency estimation procedures affected a good deal 
more than the Levine procedures. It is also apparent that 
matching on the basis of common items, although it pro- 
vides greater similarity among the scaled-score means ob- 
tained by the five methods- provides only slightly improved 
results for the Tlicker and frequency estimation procedures 
and actually worsens the result?, obtained by the Levine and 
chained equipercentile procedures. 

American History and Social Studies 

Table 15 and Figures 25-27 contain the results of the equat- 
ings carried out for the American History and Social Studies 
Achievement Test. Table 15 contains scaled score summary 
statistics resulting from the application of the five equating 
methods to the three sampling combinations (fall-fall ran- 
dom, fall-spring random, and fall-spring matched using 
common items). Examination of the information provided 
in Table 15 indicates close agreement among the results of 
the five equating methods applied to the fall-fall random 
groups sampling combination. The criterion is the result of 
the T\icker fall-fall random groups equating used to report 
scores for the fall new-form group. However, as was the 
case for the Biology and Mathematics Level II Tests, it is 
clear that all equating procedures used for the fall-fall sam- 
pling combination would have resulted in similar scaled- 
score summary statistics. 

The results of applying the five equating methods to 
the fall-spring random groups sampling combination dem- 
onstrate that the equating procedures most cleariy affected 
by the diflierences in group ability are the Levine proce- 
dures. The TUcker and frequency estimation procedures be- 
have similarly for the fall-spring random groups combina- 
tion, each pnxlucing scaled scores that overestimate the 
criterion scaled scores. The equating procedure used with 
this test that appears to be the least af*>cted by differences 
in group ability is the chained equipercentile procedure. 

The scaled-score summary statistics resulting from the 
samples constructed by matching (based on the distribu- 
tions of scores on the common items) presented in Table 15 
show that matching provides greater agreement among the 
five equating methods applied to the American History 
and Swial Studies Test. It should be noted that, in all 
cases, matching provides scaled-score summary statistics that 



appear to overestimate the criterion scaled-score summary 
statistics. 

Figures 25 and 26 provide equating-diflFerence plots for 
the American History and Social Studies Test. Examination 
of the results contained in panel (a) of Figure 25 indicates 
the close agreement between the T\icker results obtained 
from the fall-spring random and fall-spring matched on 
common-items groups. It is clear that the Tbcker results, 
applied to both sample combinations, overestimate criterion 
scaled scores in the upper end of the raw-score distribution 
and underestimate criterion scaled scores for the lower end 
of the distribution. 

The information presented in panels (b) and (c) is in- 
dicative of the serious eflf^^pt the differences in ability levels 
of the fall and spring groups have on the Levine equally and 
unequally reliable equating procedures applied to this test. 
(See Table 5 for a comparison of these differences.) Both 
the Levine procedures used with the fall-spring random 
groups sample combination provide an underestimate of the 
scaled scores obtained using data from the fall-fall random 
groups combmation. These equating procedures applied to 
the fall-spring matched through common-items groups pro- 
vide an overestimate of scaled scores in the upper end of the 
raw-score distribution and an underestimate of scaled scores 
in the lower end of the distribution when compared with the 
fall-fall random groups results. 

Information provided in panel (d) of Figure 25 indi- 
cates the effect on the results of the chained equipercentile 
procedure of matching on the basis of the distribution of 
scores on the common items. It is quite clear that matching 
had a deleterious effect on this equipercentile procedure 
used with this particular test. Panel (e) contains the results 
of the frequency estimation equipercentile procedure. The 
plots shown in this panel indicate that the frequency esti * 
mation results are affected by the differences in group ability 
displayed by the fall-spring random groups and that match- 
ing on the basis of common-item score distributions scoring 
seems to slightly exacerbate this effect. 

Figure 26 contains plots of equating results for the 
American History and Social Studies Test that allow com- 
parisons of the five equating methods applied to a particular 
sampling combination. The results of the equatings shown 
in panel (a) of Figure 26 indicate that both curvilinear pro- 
cedures, particularly the chained equipercentile procedure, 
provide results that fall slightly outside the ± 2 standard er- 
ror bands for scores in the middle part of the distribution 
and again for scores in the extremes of the distribution. 

Panel (b) contains the results of the five equating meth- 
ods applied to the fall-spring random groups sampling com- 
bination. Examination of the plots contained in panel (b) 
shows the serious effect that differences in group ability, dis- 
played by the fall-spring random groups combination, have 
on the T\icker and the two Levine equating procedures. 
Scaled scores resulting from the two curvilinear procedures 
fall mostly within the standard error bands provided for the 
criterion-equating results for raw scores ranging from about 
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10 to 60. Frequency estimation equating results show a ten- 
dency, as do results of the other methods, to seriously over- 
estimate scaled scores obtained for raw scores in the upper 
end of the distribution. 

Information provided in panel (c) of Figure 26 sum- 
marizes the results of the five equating procedures applied 
to the fall-spring groups matched using distributions of 
scores on the common items. The plots shown in panel (c) 
indicate that matching, using the frequency distribution of 
scores on the common-item set, brings the results of all four 
equating procedures closer together; in particular, the linear 
procedures now show fairly close agreement. It appears as 
''hough all the procedures provide scores that are a consid- 
erable overestimate of the criterion scores for the upper end 
of the raw-score distribution. 

Figure 27 contains plots of American History and So- 
cial Studies Test new-form scaled-score means obtained for 
each equating-method and equating-sampling combination. 
It is clear from an examination of the scaled-score means 
that all equating methods, with the exception of the chained 
equipercentile method, are affected by the differences in 
group ability presented by the fall-spring random group 
sampling combination. The differences in sample ability 
level affect the Hicker and frequency estimation procedures 
in the opposite direction from the effect the differences have 
on the Levine equally and unequally reliable procedures. 
An additional point that is apparent from the plots is that 
matching, using scores on the common-item set, has a del- 
eterious effect on all methods except the Levine equally and 
unequally reliable methods in that the resulting means are 
more discrepant from the criterion mean than the means re- 
sulting from no matching. Matching does provide results 
among the methods that are more in agreement. 

Chemistry 

The experimental equatings performed for the Chemistry 
Achievement Test are summarized in Table 16 and Figures 
28 through 30. Table 16 contains scaled-score summary .sta- 
tistics resulting from application of the five equating meth- 
ods to the three new-form and old-form sampling combina- 
tions (fall-fall random, fall-.spring random, and fa'i-spring 
matched on common items). 

The information shown in Table 16 indicates that the 
five equating methods provided results in close agreement 
for the fall-fall random groups sampling combination. Re- 
sults for the equatings using the fall-spring random groups 
combination differed considerably from the criterion scaled- 
score means (the fall-fall Hicker results used to report 
scores for the fall new-form group). As was the case for the 
results of the equatings for the tests previously discussed. 
Tucker and frequency estimation equipercentile equating 
methods provide similar results for this sampling combina- 
tion, and the Levine equally and unequally reliable and 
chained equipercentile procedures provide results that differ 
somewhat from each other and from the other two equating 
procedures. Examination of the results obtained for the five 

20 



equating methods applied to the Chemistry Test fall-spring 
groups matched using the common-item set shows close 
agreement among all five equating procedures; however, the 
discrepancies between the matched-group means and the 
criterion mean are quite pronounced, more so than for any 
test discussed so fan 

Figure 28 contains plots of equating differences for the 
Chemistry Test that parallel those described for the three 
tests that have been previously discussed. The plots pre- 
sented in Figure 28 have been developed to permit compar- 
isons of the manner in which a particular equating method 
performs across the three sampling combinations. 

Examination of the information presented in panel (a) 
of Figure 28 shows that the T\icker method applied to the 
Chemistry Test data is affected by the differences in group 
ability exhibited by the fall-spring random groups combi- 
nation. Furthermore, matching the new- and old-form 
groups using distributions of scores on the common-item 
test has little effect on the Tucker results. Tucker results for 
both fall-spring combinations were quite similar. 

The information provided in panels (b) and (c) indi- 
cates that the Levine equally and unequally reliable re- 
sults are also affected by differences in group ability and pro- 
vide a considerable overestimate of ^'caled scores (when 
compared with the results of the fall-.^'l sampling combina- 
tion) in the upper end of the score distribution. Matching on 
the basis of distributions of scores on the common items 
piovides peculiar results in that the matched-group equa' 
ing provides scores similar to those obtained from the 
unmatched fall-spring group for the upper end of the score 
distribution and scores that are more of an overestimate 
for the lower end of the score distribution. In general, 
matching seems to have a deleterious effect on both Levine 
procedures, 

information provided in panels (d) and (e) (for the 
chained equipercentile and the frequency estimation equi- 
percentile procedures, respectively) is quite similar. In gen- 
eral, both proc^Jures appear to be affected by the differ- 
ences in group ability displayed by the fall-spring random 
groups combination. And, as whs the case for the Tucker 
and Levine procedures used with this test, matching on the 
basis of common items provides somewhat worse results for 
both procedures but dc/es not havc^ as serious an effect on the 
results as that observed for the Lev ine procedures. 

Figure 29 contains equating- difference plots that have 
been developed to provide comparisons of the different 
equating procedures applied lo a particular sampling com- 
bination. 

Examination of the information provided in panel u) 
of Figure 29 shows that, for the fall-fall random groups 
combination, the equating results provided by the two 
Levine procedures do not differ substantially from the T\icker 
criterion results. The two curvilinear procedures, chained 
equipercentile and frequency estimation equipercentile, 
also appear to agree fairiy well with the criterion results 
through the midportion of the score range. The plots shown 



in panel (b) of Figure 29 illustrate dramatically the effect 
that the differences in group ability have on the five equating 
methods used with the Chemistry Test. For the fall-spring 
random groups combination in panel (b), none of the pro- 
cedures provide results that are within two standard errors 
of the criterion scaled scores. However, the effect of match- 
ing on the common-item scores also supplies extreme re- 
sults, as illustrated by the plots shown in panel (c), with all 
methcxls providing similar overestimates of the criterion 
scaled scores. 

Further insight into the Chemistry Test equating results 
can be obtained by an examination of the scaled-score 
means for the various equating procedures and sampling 
combinations that are provided in Figure 30. 

Examination of the information provided in Figure 30 
shows the serious effect on all five equating procedures of 
the ability differences exhibited by the fall-spring random 
groups comjination. In this case, the Tbcker and frequency 
estimation procedures appear to be the most affected, 
whereas the Levine equally and unequally reliable proce- 
dures are the least affected. Matching on the basis of the 
common-item set provides scaled-score means for the two 
Levine and chained equipercentile procedures that are con- 
siderably more discrepant from the criterion scaled-score 
mean than those provided by the unmatched fall-spring ran- 
dom groups equating. Matching also appears to provide 
slightly more discrepant means (as compared with the un- 
matched fall-spring means) for the TUcker and frequency es- 
timation equipercentile procedures. 

French 

Scaled-score summary statistics resulting from the experi- 
mental equatings for the French Test are provided in Table 
17, Examination of the information provided in Table 17 
indicates that all the French Test equatings performed using 
the random samples selected from the fall new-form and 
old-form combination are quite similar. Again, the criterion 
equating for the French Test is the equating actually used to 
report scores for the fall new-form groups (i.e., the fall-fall 
random groups Tlicker results). It is important to note the 
differences obtained by applying the five equating proce- 
dures to the groups randomly selected from the fall new- 
form and spring old-form populations. It is clear, from the 
summary statistics provided in Table 17, that the differences 
in ability levels of the new- and old-form groups taking the 
French Test have a substantial effect on the equating results 
for all procedures, with the Levine procedures affected the 
least. 

Examination of the information presented in Table 17 
for the equatings based on the French Test fall new-form 
and spring old-form groups matched using scores obtained 
on the common items indicates, as was the case for all tests 
discussed thus far, remarkable consistency among the 
scaled-score summary statistics obtained by application of 
the five equating procedures. An important point to note is 
that, although matching promotes consistency among the 
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results of the five equating methods, all five sets of sum- 
mary statistics provide overestimates of the criterion scaled- 
score mean and standard deviation. 

As expected, the equating results based on the French 
Test fall new-form and spring old-form groups, matched 
using responses to the two SDQ questions or on the single 
BQ question, produced scaled-score summary statistics that 
were discrepant from the criterion equating summary statis- 
tics. Matching on the basis of SDQ responses appears to 
have provided summary statistics, particularly means, that 
are very similar to those obtained for the unmatched fall- 
spring random groups combination. Matching on the basis 
of BQ responses appears to ,iave exacerbated the situation 
by providing statistics even more discrepant from the crite- 
rion equating results than those obtained using the fall- 
spring random groups equatings. 

Further insight into the equating results presented in 
Table 17 can be obtained by examining the equating- 
ditFercncc plots shown in Figure 31. Similar to the other 
tests discussed thus fan the plots shown in Figure 31 allow 
comparison of the results of the application of the particular 
equating procedure across the five sampling combinations. 
Remember that, in each case, the fall-fall results used for 
comparison are those obtained by the particular equating 
procedure that is being evaluated. 

Examination o** the results presented in panel (a) of 
Figure 31 shows the close agreement among the Tucker re- 
sults for all the fall-spring sampling combinations (with the 
exception of the results based on the BQ matching) and how 
these results overestimate the criterion scaled scores 
through most of the raw-score range, but particularly in the 
upper end of the raw-score distribution. The results pre- 
sented in panels (b) and (c) are somewhat similar to the 
Tlicker results presented in panel (a). Ii general, the fall- 
spring random and matched Levine results (equally and un- 
equally reliable) have a tendency to overestimate scores 
(when compared with the results of the fall-fall Levine 
equatings), particularly in the upper end of the score distri - 
bution. It is important to noie how matching on the basis of 
scores obtained on the common-item set appears to worsen 
the situation. 

The results presented in panel (d) for the chained equi- 
percentile procedure indicate that all equatings based on the 
fall-spring sampling combinations (both matched and un- 
matched) have a tendency to overestimate scaled scores 
when compared with the fall-fall random groups results. 
Th' - trend is also apparent for the frequency estimation 
equ ,.:rcentile results summarized in panel (e). In both 
cases, the equaling results based on matching using scores 
on the common items appear to provide scaled scores that 
arc more of an overestimate of the fall-fall results than those 
obtained using the unmatched fall-spring random groups 
samples. 

The French Test residual plots shown in Figure 32 have 
been developed to permit the examination of the agreement 
among the five equating procedures applied to a pcv-Mcular 
sampling combination. The plots shown in panel (a) com- 

er|c 



pare results for the five equating procedures applied to the 
fall-fall random groups samples. Those shown in panel (b) 
are for the equating procedures applied to the fall-spring 
random groups combination. Panel (c) contains plots perti- 
nent to the evaluation of the results of equatings based on 
groups matched using common-item scores; panel (d) con- 
tains equating results for groups matched on the basis of the 
two SDQ questions and panel (e), equating results for 
groups matched on the basis of their responses to the single 
BQ question. 

Examination of the difference plots shown in panel (a) 
of Figure 32 indicates fairiy close agreement of the results 
obtained for the five procedures with the criterion scaled 
scores; however, it should be noted that both equipercentile 
procedures have a tendency to underestimate criterion 
scaled scores in the middle of the distribution and overesti- 
mate these scores in the ends of the raw-score distribution. 
Panel (b) of Figure 32 contains equating discrepancies for 
the five methods applied to the fall-spring random groups 
combination. It is clear, from examination of the plots 
shown in panel (b), that all procedures had a tendency to 
overestimate the criterion scaled scores in the upper end of 
the raw-score distribution. 

Panels (c), (d), and (e) contain the results of matching 
the fall new-form and spring old-form groups in ability level 
on the basis of scores on a common item set, responses to 
the SDQ questions, or responses to the BQ questions. It is 
clear, from an examination of the results provided in these 
three panels, that matching, using any of the three covar- 
iates, did not provide satisfactory results. Matching on the 
basis of common-item scores, as seen in panel (c), provides 
consistent results across equating methods, but these results 
also provide consistent overestimates of the criterion scaled 
scores. 

Figure 33 summarizes scaled-score means resulting 
from the application of the five equating procedures to the 
five sampling combinations used for the French Test. Ex- 
amination of the information presented in Figure 33 shows 
several trends that have been observed from the other 
Achievement Tests that have been discussed so far. First, 
equatings based on groups that are similar in ability level 
provided consistent results across the five methods exam- 
ined. Second, equatings based on groups differing in ability 
level provide different results depending on the equating 
procedures that are used. In the case of the French Test, the 
Tucker and frequency estimation procedures appear to be 
the most affected by differences in ability levels, whereas 
the Levine procedures appear to be ^he least afiected. 

Comparison of Equating Results Across Tests 

It is important to compare the results provided by the five 
equating methods applied to the various sampling combina- 
tions for the five tests used in this study. For this compari- 
son, the results of matching using responses to the SDQ or 
BQ are not discussed. Basically, it was not possible to 
match groups on ability level using these covariates; there- 
fore, including the equating results obtained for these 
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samples in a comparison of the results of equatings using 
matched versus unmatched samples would not contribute to 
the discussion. 

Figures 2K 24, 27, 30, and 33 provide the most useful 
information for comparing equating procedures and sam- 
pling combinations across the five tests. As mentioned pre- 
viously, several trends are worth noting when comparing 
scaled-score means contained in these figures. First, the 
mosi consistent result across all five tests is that when new- 
and old-form samples are similar in level of ability, either 
because they are randomly selected from similar popula- 
tions or because they have been deliberately constructed to 
be the same (i.e., matched on a set of common items), all 
the equating procedures evaluated in this SiUdy provide sim- 
ilar results. Second, in almost all instances, matching the 
old-form group to the new-form group using an internal set 
of common items produces scaled-score means that differ 
from those obtained when random samples from fall popu- 
lations are used. 

A third observation is that differences in ability levels 
of new- and old-form Samples affect equating procedures 
diflferently across tests. In the case of the Biology Test, the 
procedures most affected by group ability differences were 
the Levine equally and unequally reliable procedures. The 
equating procedures least affected by these ability differ- 
ences were the T\icker and frequency estimation equating 
procedures. Matching on a set of common items, although 
pulling means obtained by the Tbcker and frequency esti- 
mation equipercentile methods somewhat away from the 
criterion scaled-score mean, generally improved the scaled- 
score means obtained by the Levine and chained equiper- 
centile procedures with respect to the criterion. 

One might expect the Chemistry Test to behave in a 
somewhat similar manner to the Biology Test. Both tests are 
reflective of specific skills developed in one or perhaps two 
high school courses. The tests differ, however, in that chem- 
istry is usually taught later in the high school curriculum 
than biology. Thus, examinees constituting spring and fall 
groups for the Chemistry Test may be closer together than 
Biology Test takers when recency of coursework is com- 
pared. This could cause the two tests to behave differently, 
which proved to be the case. For the Chemistry Test, the 
procedures most affected by differences in group ability 
were the Tucker and frequency estimation equipercentile 
procedures. The procedures least affected were the Levine 
equally and unequally reliable procedures. Matching on the 
basis of scores on a set of common items slightly worsened 
the 'Ibcker and frequency estimation equipercentile results, 
when compared with the criterion scaled-score mean, and 
greatly worsened results obtained for the Levine and 
chained equipercentile procedures. 

Scaled-score means for the American History and So- 
cial Studies Test, displayed in Figure 27, indicate that the 
equating procedure least affected by differences in group 
ability is the chained equipercentile procedure. The proce- 
dures most affected are the Levine equally and unequally 
reliable procedures. The Tucker and frequency estimation 
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equipercentile procedures are affected by differences in 
group ability in a similar manner. Matching new- and old- 
form samples on the basis cf common items was in this case 
quite unsuccessful, increasing the discrepancy between the 
experimental and criterion -Ciiled-score means for all pro- 
cedures except the Levine procedures. 

Referring to the results displayed for the Mathematics 
Level II Test in Figure 24, one can see that the procedures 
most affected by differences in group ability are the Tlicker 
and frequency estimation equipercentile procedures. The 
procedures least affected by differences in ability levels of 
the equating samples are the Levine equally and unequally 
reliable procedures. Matching on the basis of scores on 
common items improves the frequency estimation equiper- 
centile equating and Tbcker results slightly, when compared 
with the criterion scaled-score mean, but definitely worsens 
both the Levine and chained equipercentile results. 

Finally, the results obtained for the French Test, which 
are displayed in Figure 33, indicate that the procedures most 
affected by differences in group ability are the 'Ricker and 
frequency estimation equipercentile methods. Ilie proce- 
dures least affected by differences in group ability are the 
Levine equally and unequally reliable procedures. Match- 
ing on the basis of common items has almost no effect on 
the Tbcker and frequency estimation equipercentile means 
but, as has been observed previously for the Chemistry and 
Mathematics Level II Tests, matching on common item 
scores worsens the results obtained for the Levine and 
chained equipercentile methods. 

It is fairly clear from the analyses carried out for this 
study that using new- and old-form samples randomly se- 
lected from populations differing in ability level to equate 
Achievement Tests may lead to scores, and particularly 
scaled-score means, on the new forms of the tests that are 
less comparable than desired. In only a few instances, when 
the equating samples involved fall-spring random groups, 
the scaled-score means produced for particular new forms 
were similar to those obtained by the criterion equatings. 
These instances were (1) with the use of chained equiper- 
centile equating for the American History and Social Stud- 
ies Test, and (2) with the use of Tlicker and frequency esti- 
mation equipercentile procedures with the Biology Test. 

From an eariier study (Cook, Eignor, and Schmitt 
1988), which examined experimental equatings for the 
Biology Test, it appeared that one solution to the problem 
of ability differences in equating samples was to match new- 
and old-form groups on some covariat^ such as a set of com- 
mon items. Indeed, when this procedure was carried out for 
the Biology Test, the experimental equatings applied to the 
matched-group data produced scores consistent with each 
other and also closely approximating the criterion scores. 
Unfortunately, results observed for the matched (common 
items) equatings for the Biology Test did not generalize to 
the other four tests examined in this study. For the remain- 
ing tests studied, matching using common items produced 
close agreement among results, but these results were dis- 
crepant from the criterion results used for comparison. 



Results of the experimental equatings for the Mathe- 
matics Level II Test indicate that if new- and old-form 
samples differ in level of ability, the equating methods that 
will produce scaled-score means closest to those obtained 
for the critenon equating are the Levine equally and un- 
equally reliable methods. The results of the matched sample 
equatings carried out for the Mathematics Level II Test in- 
dicate that, for the Levine equally reliable, Levine un- 
equally reliable, and chained equipercentile procedures, 
matching produces scaled-score means that are more dis- 
crepant from the criterion mean than not matching. The re- 
sults of the frequency estimation and Tucker procedures 
used with the matched groups indicate that matching pro- 
duces means slightly less discrepant from the criterion mean 
than if matching had not taken place. Generally, for the re- 
maining tests, although one equating method may produce 
slightly better results (better in the sense that the scaled- 
score mean more closely matches the criterion mean), most 
methods are affected by differences in ability level between 
the new- and old-form samples. In all cases, however, ex- 
cept for the Levine equally and unequally reliable proce- 
dures used with the Biology and American History and So- 
cial Studies Tests, matching on a set of internal common 
items provides scaled-scoic means even more discrepant 
from the criterion means. 

As mentioned previously, matching using responses to 
questions from the SDQ or the BQ was singularly unsuc- 
cessful. This is undoubtedly due to the low correlations be- 
tween the respective covariaies and Achievement Test 
scores. In addition, problems using SDQ responses to 
match groups in ability level were also probably related to 
the lower response rate to the SDQ questions used for this 
study. 

One question that may be asked is: why did the Tucker 
and frequency estimation equating results for the Biology 
Test appear to be so invariant to differences in group ability? 
These two procedures when applied to the remaining four 
tests seemed to be quite affected by ability differences. 

One of the main differences between the Biology Test 
and the remaining four tests studied is the length of the an- 
chor test, 36 common items as opposed to a maximum num- 
ber of 21 in the other tests. Perhaps because the additional 
number of items in the Biology Test common-item set pro- 
vided a more reliable anchor-test score, scores on the an- 
chor test contributed to the stability of the Tucker and fre- 
quency estimation equipercentile equating results across 
conditions. Some basis for this hypothesis exists in the lit- 
erature. 

Klein and Kolen (1985) investigated the relationship 
between accuracy of equating results and length of anchor 
test using the T\icker model. These researchers concluded 
that, "When the tests being equated were very similar, or in 
this particular case, identical, and the groups of examinees 
very similar, substantially more accurate equating was not 
obtained by lengthening the anchor. However, longer an- 
chors did result in more accurate equating when the groups 
of examinees were dissimilar'' (p. 10). An important ques- 
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tion is whether or not the Tucker and frequency estimation 
equating results would have been more stable across the 
fall-spring random sampling combinations for the other four 
tests if the anchor test had been longer. 

One confusing aspect of the Biology results is that the 
Biology Test common-item delta plot for the fall-spring ran- 
dom groups: combination demonstrated the greatest scatter 
(see Figure 14) and the lowest correlation (.73) displayed 
by any of the five tests. One would expect that rhe poor 
relationship between the common items might have had a 
serious effect on all the equating procedures, not just the 
Levine equally and unequally reliable and the iiained equi- 
percentile procedures. The differential scatter in the equated 
delta plots observed for the fall-fall random, fall-spring ran- 
dom, and fall-spring matched (common items) samples con- 
tains important information for all the tests. What is most 
important to note is that only for the Biology Test, and to a 
much lesser extent the Chemistry Test, is the scatter more 
pronounced for the fall-spring random samples than for the 
fall-fall random samples. For the remaining three tests, the 
delta plots and accompanying correlation coefficients seem 
unaffected by differences in group ability. Also note that 
matching on distributions of scores on the set of common 
items had no effect on the scatter observed in the delta plots 
for any of the tests. 

Angoff (1987) suggests that researchers should differ- 
entiate between groups that differ in ability level and groups 
that differ in patterns of performance on anchor-test items. 
He describes a situation in which, if groups were selected 
**. . . in such a way as to guarantee nearly equal overall 
means and standard deviations in the two groups, there is a 
likelihood that a significant interaction between item per- 
formance and group would still [be] found" (p. 293). He 
hypothesizes that this is due to the fact that the common 
items measure different psychological or educational traits 
for the two groups. 

Following Angoff's reasoning, it could be possible that 
the fall-spring ability differences reflected in the common 
item scores do not have the same implications for all five 
tests. For the Biology Test, and to a lesser extent the Chem- 
istry Test, these differences could indicate that the fall- 
spring groups differ not only in the ability measured by the 
tests, but in other fundamental educational traits. On the 
other hand, the groups taking the remaining three tests 
simply represent groups that are at different points on a 
hnear continuum of ability. 

Another interesting question is why matching on com- 
mon items had such a positive effect on the Levine equally 
and unequally reliable results for the Biology Test and for 
the American History and Social Studies Test and such a 
negative effect on the Levine results for the other Achieve- 
ment Tests. 

Petersen. C(K)k, and Stocking { 1983) compared equat- 
ing results from Tucker, Levine equally reliable, Levine un- 
equally reliable, and IRT priKedures applied in an anchor- 
test design used to equate the SAL These researchers con- 
cluded that, of the conventional procedures evaluated, the 
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most stable equating results were provided by the Levine 
equally reliable model, and the least stable results were pro- 
vided by the Tlicker model. Petersen, et aL, explained their 
results as follows: "Implicit to the derivation of the Tlicker 
model is the assumption of random groups (Angoff 1984; 
Levine 1955). Because the samples for the test editions to 
be equated were not random samples from the same admin- 
istration, and in several instances, differed considerably 
in ability level, it is not surprising that the Levine models 
gj^vA more satisfactory results than the Tucker modeP' (p. 
152). The explanation provided by Petersen, et al., is appro- 
priate as an explanation of the Tucker and Levine equally 
reliable results obtained in the current study for the Mathe- 
matics Level 11, Chemistry, and French Tests. It does not, 
however, provide an explanation for the results obtained 
for the Biology *und American History and Social Studies 
Tests. 

The results obtained in the current study for the Chem- 
istry Test were somewhat perplexing. It was expected that 
equating results obtained for this test would be similar to 
those obtained for the Biology Test; however, this similarity 
was not observed. For the Chemistry Test, the Tlicker and 
frequency estimation equating scaled-score means observed 
for the fall-spring random samples were very different from 
those observed for the criterion equating. What makes this 
observation even more surprising is that the differences in 
ability level observed for the Chemistry Test fall-spring ran- 
dom groups (as evidenced by scores on the common-item 
set) were the smallest of any of the five tests included in this 
study. 



Conclusions 

The results of the current study should be considered tenta- 
tive for several reasons. For one, the criterion used to eval- 
uate the experimental equatings was a very pragmatic one. 
The criterion for each test in the study was based on the 
sampling plan that is currently used to provide reported 
scores for the particular Achievement Test investigated. The 
question explored was: what would be the effect on reported 
scores of changing the currently used sampling plan? And, 
given that a serious effect was observed, could this effect be 
ameliorated by matching new- and old-form samples on 
ability level using distributions of scores on a set of com- 
mon items or on some other covariate, such as responses to 
a background questionnaire given at the time of testing or 
when an examinee first registers to take a test? Use of a 
different equating criterion might possibly have led to a dif- 
ferent interpretation of the results of the study. However, the 
chosen criterion has practical value for the specific ques- 
tions investigated. A second limitation of this study is a lack 
of replication. Had different test forms or samples been cho- 
sen for this study, the results of the study might have dif- 
fered. 

In spiu' of ihe caveats mentioned above, several con- 
clusions can be drawn from the results of this study. The 



most important conclusion appears to be that it may be dif- 
ficult, and in some cases impossible, to equate Achievement 
Tests using new- and old-form samples obtained from dif- 
ferent populations of examinees. It appears as though all 
equating methods are affected by these difference? to some 
extent. In general, the equating models that appear vi be the 
most affected by differences in group ability are the Tucker 
and frequency estimation equipercentile models. The mod- 
els that appear to be the most robust to differences in group 
ability are the chained equipercentile and, in some in- 
stances, the Levine models. 

Matching on the basis of observed scores on a set of 
internal common items does not remedy the situation. In 
general, matching produces results for all equating proce- 
dures that are similar but that over- or underestimate scaled 
scores. Because the results are similar across methods, the 
effect can be quite misleading in that, in the absence of a 
criterion, one could conclude that because consistent results 
are obtained across methods, the results are close to **truth.** 
This was not found to be the case for the situations investi- 
gated in this study. 

To summarize, the results of the study cast serious 
doubt on the viability of an anchor-test data collection de- 
sign when samples that differ in ability are used to equate 
Achievement Tests of the type investigated in this study. It 
appears as though the equating models evaluated in this 
study are seriously affected by differences in group ability, 
and that this effect occurs even when there is little group- 
by-item interaction displayed by the common-item set. Fur- 
ther, attempts to ameliorate the effect of group differences 
by matching on a covariate such as the common-item set 
itself were quite unsuccessful, and matching is not recom- 
mended as a procedure for rectifying the problem of group 
ability-level differences when using an anchor-test equafing 
design to equate Achievement Tests such as those used in 
this study. 

An important consideration for future study is what, if 
any, effect does equating tests based on new- and old-form 
samples selected randomly from spring administrations (as 
compared with fall administrafions) have on equated scores. 
The question remains whether or not it would be a prudent 
and appropriate policy to introduce Achievement Tests at 
multiple time points in the school year as long as new- and 
old-form samples are both randomly selected from, say, two 
December or two June administrations. It is possible that 
introduction of new forms rhould be restricted to only fall 
or only spring administrations. I'urther study to address this 
question is recommended. 
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Figure 1. Distributions of scores and percentages below on the 36 common items used in matching the Biology Test 
spring old-form group to the fall new-form group. 
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For each of the subject areas in questions 12 through 17, 
blacken the latest year-end or midyear grade you received 
since beginning the ninth grade. For example, if you are a 
senior and have not taken biology or any other biological 
science since your sophomore year, indicate ihat year^^nd 
grade. If you are a junior and have completed the first half 
of the year in an English course, indicate that midyear 
grade. 

If you received the grade in ar advanced, eccelerated, or 
honors course, also blacken the letter H. 

(A) Excellent (usually 90-100 or A) 

(B) Good (usually 80-89 or B) 

(C) Fair (usually 70-79 or C) 

(D) Passing (usually 60-69 or D) 

(F) Failing (usually 59 or below or F) 

(G) Only "pass-iail" grades were assigned and I received a 
pass. 

(H) The grade reported was in an advanced, accelerated, or 
honors course. 

12. English 

13. Mathematics 

14. Foreign Languages 

15. Biological Sciences 

16. Physical Sciences 

17. Social Studies 



Original 
Response 



F 
D 
C&G 
B 
A 
H 



Recoded Numerical 
Regpon.se 



1 
2 
3 
4 

5 

If H is marked, advance students 
recoded numerical response by one 



Fall new form 
.62 



Spring old lorm 
.45 



Correlation 
of question 
15 responses 
with Biology 
Test scores 

Figure 2. Biology SDQ question IS and specifications for the receding of re- 
sponses for cross -tabulation and matching purposes; correlations of recoded re- 
spouses with Biology Achievement Test scores for fall new-form and spring old- 
form groups. 
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Questions 47 through 60 concern how you feel you com- 
pare with other people your own age in certain areas of 
ability. For o^^ch field, blacken the letter 

(A) if you feel you arc in the highest 1 percent in that area of 
ability 

(D) if you feel you are in the highest 10 percent in that area 
of ability 

(C) if you feel you are above average in that area of ability 

(D) if you feel you art? average in that area of ability 

(E) if you feel you are below average in that area of ability 

47. AcUng ability 

48. Artistic ability 

49. Athletic ability 

50. Creative writing 

51 . Getting along with others 

52. Leadership ability 

53. Mathennatical ability 

54. Mechanical ability 

55. Musical ability 

56. Organizing work 

57. Sales ability 

58. Scientific ability 

59. Spoken expression 

60. Writter expression 



Original 


Receded Numerical 


Response 


Response 


E 


1 


D 


2 


C 


3 


B 


A 


A 


5 



Correlation Fall new form Spring old form 

of question .40 .^^3 

58 responses 
with Biology 
Test scores 

Figure 3. Biology SDQ question 58 and specifications for the receding of re- 
sponses for cross-tabulation and matching purpose's; correlations of receded re- 
sponses with Biology Achievement Test scores for fall new-form and spring old- 
form groups. 
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- Biology - 

R«coded Qu«Btion 56 
12 3 4 5 



Rtcoded 
QudBtion 
15 



0 


0 


1 


3 


1 


0 


2A 


16 


5 


4 


6 


67 


132 


104 


21 


0 


59 


197 


295 


166 


0 


7 


35 


113 


90 




Fall 


now form 


group 




1 


R«coded Question 56 
2 3 4 


5 



134B 

frequ«nci«8 tnlssing ■ 1058 
percent mlRsing ■ 44Z 



Kecodtd 

Question 
15 



Recoded 
Question 
15 



2 


6 


29 


9 


3 


1 




26 


299 


160 


57 


U 


4 


43 


752 


124? 


758 


163 


5 


10 


523 


1577 


2153 


1011 


6 


5 


112 


536 


1233 


1027 



Sprine old form total group 



Recoded Question 56 
2 3 4 



24 



16 



67 



132 104 



21 



11,626 

frequencies missing - 11,543 
percent missing ■ 49Z 



59 197 295 166 



35 113 



90 



Spring old form matched group 



1348 



Figure 4. Cross-tabulations of recoded responses to SDQ questions IS and 58 used in matching 
the Biology spring old-form group to the fall new-form group. 



- Mathematics Lev«I l\ 



Fall new form Spring old form Spring old form 

group total group matched group 

Score Frequency Percent Below Frequency Percent Below Frequency Percent Below 



19 


24 


99.0 


42 


97.9 


15 


99 


18 


42 


97.1 


95 


93.2 


26 


97 


17 


44 


95.2 


106 


87.9 


28 


95 


16 


58 


92.7 


86 


83.6 


37 


92 


15 


101 


88.3 


139 


76. 7 


64 


88 


14 


131 


82.6 


145 


69.5 


83 


82 


13 


182 


74.7 


159 


61.6 


115 


74 


12 


196 


66.2 


184 


52.4 


1Z3 


66 


11 


186 


58.2 


167 


44 . 1 


117 


57 


10 


226 


48.4 


173 


35.5 


142 


48 


9 


213 


39. 1 


148 


28. 1 


134 


38 


8 


201 


30.4 


133 


21.5 


127 


29 


7 


162 


23 .3 


104 


16.3 


104 


22 


6 


135 


17.5 


84 


12. 1 


84 


16 


5 


122 


12.2 


76 


8.4 


76 


11 


4 


90 


8.3 


57 


5.5 


57 


7 


3 


67 


5.4 


34 


3.8 


34 


5 


2 


54 


3.0 


37 


2.0 


37 


2 


1 


22 


2.1 


18 


1.1 


16 


1 


0 


30 


0.8 


13 


0.4 


13 


0 


-1 


14 


0.2 


3 


0.3 


3 


0 


-2 


3 


0.0 


4 


0.1 


4 


0 


-3 


1 


0.0 


1 


0.0 


1 


0 


-4 


9 

2,304 


0.0 


1 

2,009 


0.0 


1 

1,443 


0 



Figure 5. Distributions of scores and percentages below on the 19 common items used in matching the Mathematics 
Level II Test spring old-form group to the fall new-form group. 
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• American History - 



Fall now form Spring old form Spring old form 

group total group matched group 

Score Frequency Percent Below Frequency Percent Below Frequency Percent Below 



20 


3 


99.9 


20 


99.0 


2 


99.9 


19 


6 


99.6 


43 


96.9 


4 


99,5 


16 


21 


98.6 


68 


93.5 


14 


98.5 


17 


15 


97.8 


37 


81.7 


10 


97,7 


16 


49 


95.5 


96 


87 .0 


33 


95.3 


15 


71 


82. 1 


112 


81.5 


47 


91.7 


14 


121 


86.2 


171 


73 . 1 


61 


85.6 


13 


119 


80.5 


145 


65.9 


79 


79.7 


12 


97 


75.8 


142 


58.9 


65 


74 .0 


11 


157 


68.3 


161 


31. 0 


105 


66.9 


10 


152 


61.0 


181 


42.1 


101 


59.3 


9 


197 


51.5 


172 


33. f> 


131 


49.4 


6 


204 


41.7 


146 


26.4 


136 


39.2 


7 


162 


33.9 


108 


21.1 


108 


31.1 


6 


XSG 


26.2 


120 


15.2 


106 


23 . 1 


5 


170 


18.0 


103 


10.1 


103 


15. 4 




133 


11.6 


76 


6.4 


76 


9.6 


3 


106 


6.5 


46 


4 . 1 


46 


6.2 


2 


64 


3.5 


35 


2.4 


35 


3.5 


1 


35 


1.8 


24 


1.2 


23 


1.8 


0 


22 


0.7 


14 


0.5 


14 


0.8 


-1 


6 


0.4 


7 


0.2 


6 


0.3 


-2 


6 


0.1 


2 


0.1 


2 


0.2 


-3 


1 


0.1 


1 


0.0 


1 


0.1 




2 

2,078 


0.0 


1 

2,031 


0.0 


1 

1,328 


0.0 



Figure 6. Distributions of scores and percentages below on the 20 common items used in matching the American 
History and Social Studies Test spring old-form group to the fall new-form group. 
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- ChemlBtry - 



Fall n«w form Spring old form Spring old form 

group total group matched group 

Scor« Frequency Percent Below Frequency Percent Below Frequency Percent Below 



20 


13 


99.4 


22 


99 . 0 


9 


on >. 
US « '4 


19 


44 


97 .2 


69 


95 . 9 


32 


n *j ^ 
U/ « i 


18 


56 


94.4 


111 


90 . 8 


40 


94 . 4 


17 


^^3 


92.3 


40 


89. 0 


31 


92 . 2 


16 




87.5 


115 


83.8 


69 


87 . 4 


15 


X26 


81.3 


167 


76.2 


91 


81 . 1 


14 


131 


74 .8 


149 


69.5 


94 


74 . 5 


13 


li>5 


67.1 


192 


60.8 


112 


66 . 7 


12 


128 


60.7 


147 


54 . 1 


92 


60 . 3 


11 


141 


53.7 


202 


45.0 


102 


53.2 


10 


194 


44 . 1 


202 


35.8 


140 


43 . 5 


9 


182 


35.1 


162 


28.5 


131 


34 . 3 


8 


138 


28.3 


1B2 


20.2 


99 


27.4 


7 


112 


22.7 


113 


15. 1 


81 


21.8 


6 


123 


16. 5 


101 


10.5 


90 


15.5 


5 


;,19 


10.6 


80 


6.9 


80 


10.0 


A 


88 


6.2 


63 


4.0 


63 


5.6 


3 


59 


3.3 


41 


2.2 


41 


2.7 


2 


30 


1.8 


16 


1.4 


18 


1.5 


1 


15 


1 .1 


20 


0.5 


11 


0.7 


0 


16 




7 


0.1 


7 


0.2 


-1 


5 


0.1 


3 


0.0 


3 


0.0 


-2 


1 


0.0 


0 


0.0 


0 


0.0 




2,017 




2,206 




1,436 





Figure 7. Distributions of scores and percentages below on the 20 common items used in matching the Chemistry 
Test spring old-form group to the fall new-form group. 
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Fall new form 


Spring 


old form 


Spring 


old form 




group 


total 


group 


matched group 


Score 


Frequency Percent Below 


Frequency 


Percent Below 


Frequency 


Percent ] 


21 


14 99.6 


39 


99.4 


7 


99.8 


20 


44 99.1 


93 


97.6 


23 


99 . 1 


19 


3 99.0 


19 


97.6 


2 


99.0 


16 


64 97.6 


186 


94 .6 


45 


97.6 


17 


172 94.6 


252 


90.4 


91 


94.6 


16 


253 90.7 


333 


65.0 


134 


90.7 


15 


71 89.5 


86 


63.5 


38 


69.5 


14 


303 64.6 


343 


77.0 


161 


84 .6 


13 


420 77.7 


450 


70.5 


223 


77.7 


12 


451 70,4 


514 


62.0 


239 


70.4 


11 


195 67.2 


207 


56.6 


103 


67 .2 


10 


464 59.3 


466 


50.6 


257 


59.3 


9 


526 50.7 


528 


42.0 


260 


50.6 


6 


609 40.7 


542 


33.0 


323 


40.7 


7 


326 35.4 


249 


26.9 


174 


3^.3 


6 


476 27.6 


395 


22.4 


253 


27.6 


5 


469 19.9 


404 


15.6 


249 


19.9 


4 


396 13.4 


315 


10.6 


211 


13.4 


3 


208 10.0 


145 


6.2 


110 


10 . 0 


2 


230 6.3 


212 


4.7 


122 


6.3 


1 


172 3.4 


155 


2.2 


91 


3.4 


0 


114 1.6 


60 


1.2 


60 


1.6 


-1 


40 0.9 


27 


0.6 


21 


1.0 


-2 


39 0.3 


26 


0.3 


21 


0.3 


-3 


11 0.1 


13 


0.1 


6 


0.1 


-4 


7 0.0 


5 


0.0 


4 


0.0 


-5 


0 0.0 


? 


0.0 


0 


0.0 




6, 125 


6,086 




3,246 





Figure 8. Distributions of scores and percentages below on the 21 crinmon items used in matching the French Test 
spring old-form group to the fall new-form group. 
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Questions 6 through 11 ask you to blacken the letter corre- 
sponding to the total years of study you expect to complete 
in certain subject areas. Include in the total only courses 
you have taken since beginning the ninlh grade and those 
you expect to complete before graduation from high 
school. Count less than a full year in a subject as a full 
year. Do not count a repeated year of the same course as 
an additional year of study, 

(A) One year or the equivalent 

(B) Two years orthc equivalent 

(C) Three years or the equivalent 

(D) Four years or the equivalent 

(E) More than four years or the equivalent 

(F) I will not lake any courses in the subject area* 

6. English 

7, Mathematics 

8, Foreign Languages 

9. Biological Sciences (forcxample^biology, botany, or zoology) 

10. Physical Sciences (for example, chemistry, physics, or earth 
science) 

11. Soda! Studies (for example, history, government, or geogra- 
phy) 



Original 


Recodecl Numerical 


ResDonse 


ResDonse 


F 


1 


A 


2 


B 


3 


C 




D 


5 


E 


6 



Correlation Fall new form Snrinr. o ld form 
of question .30 .32 

8 responses 
wi '. h French 
Test scores 

Figure 9. French SDQ question 8 and specifications for the receding of responses 
for cross-tabulation and matching purposes; correlations of receded responses with 
French Achievement Test scores for fall new-form and spring old-form groups* 



For each of the subject areas in questions 12 througii 17, 
blacken the latest year-end or midyear grade you received 
since beginning the ninth grade. For example, if you are a 
senior and have not taken biology or any other biological 
science since your sophomore year, indicate that year-end 
grade. If you are a junior and have completed the first half 
of the year in an English course, indicate that midyear 
grade. 

If you received the grade in an advanced, accelerated, or 
honors course, also blacken the letter H. 

(A) Excellent (usually 90-100 or A) 

(B) Good (usually 80-89 or B) 

(C) Fair (usually 70 79 or C) 

(D) Passing (usually 60-69 or D) 

(F) Failing (usually 59 or below or F) 

(G) Only "pass-fail" grades were assigned and I received a 
pass. 

(H) The grade reported was in an advanced, accelerated, or 
honors course. 

12. English 

13. Mathematics 

14. I'orcign Languages 

15. Biological Sciences 

16. Physical Sciences 

17. Social Studies 



Origiral 
Response 



F 
D 
C&G 
B 
A 
H 



Recoded Numerical 
Response 



If H 



1 

2 
3 

5 
is 



marked , advance students 



recoded numerical response by one 



Fall now form 
.27 



Spring old form 
.38 



Correlation 
of question 
lU responses 
with French 
Test scores 

Figure 10. French SDQ question 14 and specifications for the receding of re- 
sponses for cross-tabulation and matching purposes; correlations of recoded re- 
sponses with French Achievement Test scores for fail new-form and spring old- 
form group*^. 
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Hecodiid Question 14 



Recoded 
Question 
6 



Recoded 
Question 
6 



Recoded 
Question 
6 





1 


2 


3 


4 


5 


6 


I 


0 


0 


0 


2 


2 


1 


9 
*> 


0 


0 


0 


2 


6 


3 


3 


0 


1 


3 


21 


32 


2 


4 


0 


0 


25 


169 


247 


60 


5 


0 


1 


55 


404 


620 


280 


6 


0 


1 


12 


154 


450 


206 








Fall new form group 


2 








Recoded 


Question 


14 






1 

1 


2 


3 


4 


5 


6 




0 


0 


0 


3 


0 


1 


9 


n 
u 


0 


2 


2 


12 


3 


•1 


0 


2 


17 


33 


56 


6 


H 


n 

V 


11 


68 


295 


362 


114 


5 


0 


5 


69 


559 


1032 


507 


6 


0 


2 


30 


195 


504 


374 






Spring old 


form total group 


i 








Recoded 


Question 


14 






1 


2 


3 


4 


5 


6 


1 


0 


0 


0 


2 


0 


1 


2 


0 


0 


0 


2 


6 


3 


3 


0 


1 


3 


21 


32 


2 


4 


0 


0 


25 


169 


247 


60 


5 


0 


1 


55 


404 


820 


260 


6 


0 


1 


12 


154 


450 


208 



2,961 



frequencies missing • 3,164 
percent missing * 51.7 



4,286 



frequencies missing - 1,792 
percent missing • 29.5 



2,959 



Spring old form matched group 

Figure 11. Cross-tebulations of recoded responses to SDQ questions 8 and 14 used in n-.:;tching the French 
spring old-form group to the fall new-form group. 
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In the group of nine ovals labeled Q, you are to fill in ONE and 
ONLY ONE oval, as described below, to indicate how you obtained 
your knowledge of French. The information that vou provide is for 
statistical purposes only and will not influence your score on the 
test > 



If your knowledge of French does 
not cooe primarily from courses 
taken in grades 9 through 12, 
fill in oval 9 and leave the 
remaining ovals blank, regard- 
less of how long you studied the 
subject in school. For example, 
you are to fill in oval 9 if 
your knowledge of French comes 
primarily from any of the 
following sources: study 
prior to the ninth grade, courses 
taken at a college » special 
study, residence abroad, or 
living in a home in which French 
is spoken. 



If your knowledge of French does 
come primarily from courses 
taken in grades 9 through 12, 
fill in the oval that indicates 
the level of the French course in 
which you are currently enrolled. 
If you are not now enrolled in a 
French course, fxll in che oval 
that indicates the level of the 
most advanced course in French 
that you have completed. 



Level I: 


first or second half 


- fill 


in 


oval 


1 


Level II: 


first half 


- fill 


in 


oval 


2 




second half 


- fill 


in 


oval 


3 


Level III 


first half 


- fill 


in 


oval 






second half 


- fill 


in 


oval 


5 


Level IV: 


first half 


- fill 


in 


oval 


6 




second half 


- fill 


in 


oval 


7 


Advanced 


Placement or course 


- fill 


in 


oval 


8 



that represents a level of 
study higher than Level IV: 
second half 

If you are in doubt about whether to mark oval 9 r^ither than one 
of the ovals 1-8, mark oval 9. 

Figure 12. Question on Background Questionnaire in French used in matching the French spring 
old-form group to the fall new-form group. 
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French 



Oval 
filled 



Fall new form 
group 

Frequtncy Percent Below 



Spring old form 
total group 

Frequency Percent Below 



Spring old form 
matched group 

Frequency Percent Below 



9 


0 


100.0 


0 


100.0 


0 


100.0 


8 


1632 


73.4 


550 


91.0 


247 


73.4 


7 


731 


61. 4i 


2283 


53.4 


111 


61.4 


6 


2271 


24 .3 


3^4 


47.7 


344 


24 .3 


5 


710 


12.8 


2242 


10.8 


108 


12.7 




596 


3.0 


298 


5.S 


90 


3.0 


3 


165 


0.0 


361 


0.0 


26 


0.0 


2 


0 


0.0 


0 


0.0 


0 


0.0 


1 


0 

6, 125 


0.0 


0 

6,078 


0.0 


0 

928 





Figure 13. Distributions of responses to question on Background Questionnaire in French used in matching the 
French spring old-form group to the fall new-form group. 
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r = *77 
fall/spring matched groups 
(SDQ) 



Figure 14. Plots of Biolog>' fall new«*form versus fall old-form equated deltas for the one fall-fall sample combination 
and versus spring old-form equated deltas for the three fall-spring sample combinations. 



- Mathematics Level II 




r = .99 r = .98 

fall /fall rnndom groups fall/sprinR random groups 
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r = .98 
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Figure 15. Plots of Mathematics Level II fall new-form versus fall old-form equated deltas for the one fall-fall sample 
combination and versus spring old-form equated deltas for the two fall-spring sample combinations. 
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- American History - 
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Figure 16. Plots of American History and Social Studies fall nen^-form versus fall old-form equated deltas for the one 
fall-fall sample combination and versus spring old-form equated deltas for the two fall-spring sample combinations. 
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- Chemistry - 
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EQUATED DELTA 
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fall/fcill random groups 
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Figure 17. Plots of Chemistry fall new-form versus fall old-form equated deltas for the one fall-fall sample combina- 
tion and versus spring old-form equated deltas for the two fall-spring sample combinations. 
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- French - 
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Figure 18. Plots of French fall new-form versus fall old-form equated deltas for the one fall-fall sample combination 
and versus spring old-form equated deltas for the four fall-spring sample combinations. 
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- French - 




Figure 18. (continued) 
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5 " 

Q 



t 9/t-m 
t r/f ^(c«lt) 




KAv I corn 

(a) 

fall-spring random or matched groups Tbcker equatings — fall-fall random 
groups Tbcker criterion equating 




RAV «COfU 



(b) 

fall-spring random or matched groups Levine equally reliable equatings- 
fall-fall random groups Levine equally reliable criterion equating 




8 - 



HAW tcoAC 




(c) 

fall-spring random or matched groups Levine unequally reliable equat- 
ings — fall-fall random groups Levine unequally reliable criterion equating 
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(e) 

fall-spring random or matched groups frequency estimation equipercentile 
equatings — fall-fall random groups frequency estimation equipercentile 
criterion equating 

Abbrevi'itions used in plots: fall-fall random groups: F/F-R; fall-spring random groups: F/S-R; fall-spring matched groups (common items): F/S-MC: fall- 
spring mau'hed groups (SIX^): F/S-MS; l\icker: T; Levine equally reliable: LH; Levine unequally reliable: LU; chained equipercentile: CB: frequency 
estimation equiperce»:tile: FE 

Figure 19. Biology equating-difference plots (falNspring random or matched groups equatings minus fall-fall random 
groups criterion equating). 
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fall-spring matched (SDQ) equatings— fall-fail random groups 1\icker cri- 
terion equating 



Abbreviations used in plots: fall-fall random groups: F/F-R: fall-spring random groups: F/S-R; fall-spnng matched gnmps (common items): F/S-MC; fall- 
spring matched groups (SDQ): F/S-MS; Tucker T; Levine equally reliable: LE; Levine unequally reliable: LU; chained equipercentile: CB; frequency 
estimation equinercentiie: Ft 

Figure 20. Biology equating-difference plots (falUfall random and falUspring random or matched groups equatings 
minus fall-fall random groups Thicker criterion equating). 
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Figure 21, Plot of Biology projected new-form scaled-score means for all equating-method and equating-sample com- 
binations. 
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(e) 

fall-spring random or matched groups frenuenc-y estimation equipcrcentiic 
cquatings— I'all-lall random groups frequency estimation equipercenti'c 
criterion equating 

Abbreviations used in plots: fall-fall random groups; F/F-R; fall-spring randon: groups: F /S-R; fall-spring matched groups (common items): F/S-MC; Tucker: 
T; Levine equally reliable: LE; Lcvine unequally reliable: LU: chained equipercentile: CE: frequency estimation equipcrteniile: FH 

Figure 22. Mathematics Level II equating-difference plots (fall-spring random or matclied groups equatlngs minus 
fall-fall random groups criterion equating). 
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fall-spring matched (common items) equatings — fall-fall random groups 
l\itker criterion equating 

Abbreviations used in plots: fall-fall random groups: F/F-R: fall-spring random groups: F/S-R: fall-spring matched groups (common items): F/S-MC; l\icker: 
T; Levine equally reliable: LB; Levine unequally reliable: LU: chained equipercentile: CE; frequency estimation equipercentile: FH 

Figure 23. Mathematics Level II equating-difference plots (falUfall random and falUs[ ring random or matched groups 
equatings minus fall-fall random groups l\icker criterion equating)* 
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Figure 24. Plot of Mathematics Level d projected new-form scaled-score means for all equating-method and equating- 
sample combinations. 
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ings — fall-fall random groups Levino unequally reliable criterion equating 
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fall-fall random groups chained equipercentile criterion equating 
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(e) 

fall-spring random or matched groups frequency estimation equipercentile 
equatings — fall-fall random groups frequency estimation equipercentile 
criterion equating 

Abbreviations used in plots: fall-fall random groups: F/F-R; fall-spring random groups: F^S-R; fall-spring m iched roups {common iteivjs): F/S-MC; Tiicker 
T; Levine equally reliable: LE; Levine unequally reliable: LU: chained equipercentile: CE; frequency es'imaiior equipercentile: FE 

Figure 25* American History and Social Studies equating-difference plots (falUspring random or matched groups 
equatings minus fall-fall random groups criterion equating). 
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(c) 

fall-spring matched (common items) equatings— fall-fall random groups 
TUcker criterion equating 

Abbreviations used in plots: fall-fall random groups: F/F-R; fall-spring random groups: F/S-R; fall-spring matched groups (common items): F/S-MC; -nicker: 
T: Levine equally reliable: LE; Levine unequally reliable: LU; chained cquipercentile: CE; frequency estiir.ation equipercentile: FE 

Figure 26. American History and Social Studies equating-difference plots (fall-fall random and fall-spring random or 
matched groups equatings minus fall-fall random groups Thicker criterion equating). 
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Figure 27* Plot of American History and Social Studies projected new-form scaled-score means for all equating- 
method and equating-sampie combinations. 
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Abbrcvialions used in plots: fall-fall random groups: FF-R: fall-spring random groups: H S-R: fall-spring matched groups (common items): F/S-MC; l\icker: 
T; Levine equally reliable: LE: Uvine unequally reliable: LU: chained equipercenlile: CB; frequency estimation equipercentile: FT- 

Figure 28. Chemistry equating-differenre plots (fall-spring random or matched groups equatings minus fall-fall ran- 
dom groups criterion equating). 
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(c) 

fall-spring matched (common items) equatings — falUfall random groups 
l^jcker criterion equating 

Abbreviations used in plots: fall-fall random groups: F/F-R; fall-spring random groups: F/S-R; fall-spring matched groups (common items): F/S-MC; Tbcker: 
T; Levine equally reliable: LE; Levine unequa.ity reliable: LU; chained equipercentile: CE; frequency estimation equipercentile: FE 

Figure 29. Chemistry equating-difference plots (fall-fall random and fall-spring random or matched groups equatings 
minus fail-fall random groups Iticker criterion equating). 
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Figure 30. Plot of Chemistry projected new-form scaled-score means for all equating-method and equating-sample 
combinations. 
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Abbreviations used in plots: fall-fall random groups: F/F-R; tall-spring randotn groups: F/S-R; fall spring matched groups (common items): F'/S-MC; lali- 
sping matched groups (SDQ): F/S-MS: fall-spnng m atched groups (BQ): F/S-MB; l\icker: T; Levine equally reliable: LB; Levine unequally reliable: LU: 
chained equipercentile: CE; ft^equency estimation equipercentile: FTi 

Figure 31. French equating-difference plots (falUspring random or matched groups equatings minus fall-fall random 
groups criterion equating)* 
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Abbreviations used in plots: fall-tall random groups: F/F-R; fall-spring random groups: F/S-R: fall-spring matched groups (common items,. i-VS-MC; fall- 
spring matched groups (SDQ): F/S-MS; fall-spring matched groups (BQ): F/S-MB; "Ricker: T; Lcvine equally reliable: LE; Levine unequally reli.^ble: LU; 
chained equipercentile: CE; frequency estimation equipercentile: FE 

Figure 32. French equating-difiference plots (fall-fall random and fall-spring random or matched groups equatSngs 
minus fall-fall random groups 'nicker criterion equating). 
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Figure 33. Plot of French projected nevv^form scaled-score means for all equating^method and equating-sample com- 
binations* 
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Table 1. Numbers of Items in New and Old Forms and In Common-Item Sets for Tests in Study 



New Form 



Old torm 



Test 



No. of 
Items 



Administered 



Biology 

Mathematics Level 11 
American History and 

Social Studies 
Chemistry 
French 



99 
50 
94 

90 
85 



fall 
fall 
fall 

fall 
fall 



Tible 2a. Content and Skills Specifications for 
ATP Achievement Tests— Biology 



Topics Covered 



Approximate Percentage 
of Test 



Cellular and Molecular Biology 
Cell structure and organization, mitosis, 
photosynthesis, cellular respiration, en- 
zymes, molecular genetics, biosynthesis, 
biological chemistry 

Ecology 

Energy flow, nutrient cycles, populations, 
communities, ecosystems, biomes 

Classical Genetics 
Meiosis. Mendclian genetics, inheritance 
patterns 

Organismal Structure and Function 
Anatomy and physiology, developmental 
biology, behavior 

Evolution and Diversity 
Origin of life, evidence of evolution, natural 
selection, speciation. patterns of evolution, 
classification and diversity of prokaryotes, 
protists. fungi, plants, and animals 



Skills Specifications 



30 



15 



10 



30 



15 



Approximate Percentage 
of Test 



Level I Essentially Recall: 

remembering specific facts; 
denionstrating straightforward 
knowledge of information and 
familiarity with terminology 

Level H Essentially Application: 

understanding concepts and re- 
formulating information into 
other equivalent terms; applying 
knowledge to unfamiliar and/or 
practical situations: solving prob- 
lems using mathematical rela- 
tionships 

Level III Essentially Interpretation: 

inferring and deducing from 
qualitative and quantit-^tive data 
and integrating information to 
form conclusions: recognizing 
unstated assumptions 



50 



30 



20 



No. of 
Items 



Administered 



Common-Item Set 

No. of 
Items 



95 
50 
100 

90 

85 



fall, spring 
fall, spring 
fall, spring 

fall, spring 
fall, spring 



36 
19 
20 

20 

21 



Table 2b, Content of ATP Achievement Tests- 
Mathematics Level II 





Approximate Percentage 


Topics Covered 


of Test 


Algebra 


18 


Geometry 


8 


Solid geometry 


Coordinate geometry 


12 


Trigonometry 


20 


Functions 


24 


Miscellaneous 


18 



Table 2c. Content of ATP Achievement Tests- 
American History and Social Studies 



Material Covered 



Political History 
Economic History 
Social History 

Intellectual and Cultural History 

Foreign Policy 
Social Science Concepts, 
Methods, and Generalizations 



Approximate Percentage 
of Test 

30-34 
16-18 
16-20 
8-10 
11-15 
10-12 



Periods Covered 



Pre-Columbian History to 1789 
1790 io 1898 
1899 to the Present 
Nonchronological 



Approximate Percentage 
of Test 

18 

35 
35 
12 
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Tlible 2d. Content and Skills Specifications for ATP Tkble 2e. Skills Specifications for ATP Achievement 

Achievement Tests — Chemistry Tests — French 



Approximate Percentaf^e 



Topics Covered 


of Test 


!. Atomic Theory and Structure 


10 


Periodic relationships 




II. Nuclear Reactions 


2 


III. Chemical Bonding and Molecular 


11 


Structure 




IV. States of Matter and Kinetic Molecular 


9 


Theory 




V. Solutions 


6 


Concentration units, solubility, and 




colligative properties 




VI. Acids and Bases 


9 


VII. idafinn-rcHiirtinn ami Flprtrfirhpmi<jtrv 


7 
t 


VIII. Stoichiometry 


1 1 


N^nii* mncpni AvnviiHrfi c niimKpr p m . 




nirical and molecular fc)rniiila<^ rvfrpnt- 




age composition, stoichiomecric calcula- 




tions, and limiting reagents 




IX. Reaction Rates 


2 


Rate equations and factors affecting 




rates 




jC F*nii 1 1 iKri iim 


0 


Mace artinn PYnrpcciniic innir p/iiiiliK- 




ria, and LeC'halelicr's principle 




\.\ Thprmniivnflntirc 

/mil I lid IIILIViy lliUlilC9 


A 


Pn^rpv rhanppQ in rhpmiral rpartinnc 

LJllvl^jr v^iiall^ws 111 didlliCal ICaCllUliSi 




runHfitnnpcc unH r*ritpri9 fc\r cfv^ntiinpitvf 
1 allUUIIlllC99( allU LlilClid lUI aUUlllallvlljr 




XII, Descriptive Chemistry 


16 

1 \> 


PhysKal and chemical properties of ele- 




ments and their more familiar com- 




pounds, including simple examples from 




organic chemistry; periodic properties 




XIII. Laboratory 


7 


Equipment, procedures, observations. 




safety, calculations, and interpanation of 




results 




Approximate Percentage 


Skills Specifications 


of Test 


Level I Essentially Recall: 


30 


remembering information and 




understanding facts 




Level II Essentially Application: 


55 


applying knowledge to unfa- 




miliar and^or practical situa- 




tions; solving problems using 




mathematical relationships 




Level III Essentially Interpretation: 


\5 


inferring and deducing from 




qualitative and quantitative 




data and integrating informa- 




tion to form conclusions 




Note: Every edition contains approximately five questions on equation 



balancing and\)r predicting products of chemical reactions. These are dis- 
tributed among the various content categories. 





Approximate Percentage 


Skills Specifications 


of Test 


Vocabulary in Context 


30 


Structure 


40 


Reading Comprehension 


30 



'ftble 3. Raw-Score Summary Statistics for Total Tests and the Common-Item Set for Samples Used 
in Equatings — Biology 



Total Test* 



Common-Item Set^ 



Sample 



Mean 



Standard 
Deviation 



Mean 



Mean as % 
of Maximum 



Standard 
Deviation 



Common Item/Total 
Text Correlation 



Fall new form (random) 


2,408 


46.33 


18.26 


17.52 


48 7 


7.52 


,92 


Fall old form (random) 


2,554 


44.97 


18.11 


17.71 


49.: 


7.60 


.92 


Spring old form (random) 


2.504 


54.15 


17.55 


21.35 


60.7 


7.23 


.93 


Spring old form (matched 


2,408 


44.26 


17.93 


17.52 


48.7 


7.52 


.93 


common items) 












7.32 


.93 


Spring old form (matched SDQ) 


1.348 


54.17 


17.76 


22.14 


61.5 



♦The new-form total test consisted of 99 items, whereas the old-form test consisted of 95 items. 
tThe common-item set consisted of 36 items. 

Table 4. Raw-Score Summary Statistics for Total Test^ and the Common-Item Set for Samples Used 
in Equatings — Mathematics Level II 



Total Test* 



Common-Item Set^ 



Sample 



N 



Mean 



Standard 
Deviation 



Mean 



Mean as ^ 
of Maximum 



Standard 
Deviation 



Common Item/Total 
Test Correlation 



Fall new form (random) 


2,304 


24.61 


9.56 


9.55 


50.3 


4.19 


.92 


Fall old form (random) 


2.346 


26.25 


9.71 


9.73 


51.2 


4.28 


.92 


Spring old form (random) 


2,009 


29.65 


10.17 


10.98 


57.8 


4.41 


.93 


Spring old form (matched 


1.443 


26.63 


9.56 


9.60 


50.5 


4.15 


.92 


common items) 

















♦The new- and old-form total tests consisted of 50 items. 
tThe common-item set consisted of 19 items. 

Table 5. Raw-Score Summary Statistics for Total Tests and the Common-Item Set for Samples Used 
in Equatings — American History and Social Studies 



Total Test* 



Common-Item Set^ 



Sample 



N 



Mean 



Stanau. J 
Deviation 



Mean 



Mean as ^ 
of Maximum 



Standard 
Deviation 



Common Item/Total 
Test Correlation 



Fall new form (random) 
Fall old form (random) 
Spring old form (random) 
Spring old form (matched 
common items) 



2.078 
2.102 
2.031 
1.329 



39.64 
40.30 
46.93 
41.11 



16.14 
16.60 
17.92 
16.62 



8.49 
8.67 
10.38 
8.71 



42.5 
4? 4 
51.9 
43.6 



4.15 
4.28 
4.48 
4.09 



.87 
.86 
.88 
.86 



♦The i»cw-torm total test consisted of 94 items, whereas the old-form test consisted of 100 items. 
tThe common-item set consisted of 20 items. 

Table 6. Raw-Score Summary Statistics for Total Tests and the Common-Item Set for Samples Used 
in Equatings— Chemistry 



Total Test* 



Common-Item Set^ 



Sample 



Mean 



Standard 
Deviation 



Mean 



Mean as ^ 
of M<i\imum 



Standard 
Deviation 



Common Item^Total 
Test Correlation 



Fall new form (random) 


2,017 


41.29 


17.84 


10.22 


51.1 


4.42 


.88 


Fall old form (random) 


2.249 


36.36 


18.91 


9.94 


49.7 


4.61 


.89 


Spring old form (random) 


2,206 


43 13 


17.74 


11.11 


55 6 


4.31 


.87 


Spring old form (matched 


1.436 


40.43 


17.54 


10.31 


51 6 


4.36 


.88 


common items) 

















♦The new- and old-form total tests consisted of 90 items. 
tThe common-item set consisted of 20 items. 



f;3 



Table 7, Raw-Score Summary StatisUcs for Total Tests and the Common*Item Set for Samples Used 
ivi Equatings — French 



Total Test* Common-Item Set^ 



Sample 


N 


Mean 


oianuaru 
Deviation 


Mean 


iVican u.v rc 
of Maximum 


Standard Common Item/Total 
Deviation Test Correlation 


Fall new fonn (random) 


6,125 


35 6. 


14.90 


8.60 


41.0 


4.66 


.87 


Fall old form (random) 


7,269 


34.42 


15.04 


8.74 


41.6 


4.69 


.88 


Spring old form (random) 


6,078 


38.38 


16.18 


9.62 


45.8 


4.91 


.89 


Spring old form (matched 


3,248 


35.39 


15.42 


8.60 


41.0 


4.66 


.88 


common items) 
















Spring old form (matched SDQ) 


2,959 


38.37 


16.04 


9.63 


45.9 


4.90 


.89 


Spring old form (matched BQ) 


928 


41.05 


16.12 


10.34 


49.2 


4.95 


.90 



♦The new- and old-form total tests consisted of 85 items. 
tThe common-item set consisted of 21 items. 



Table 8. Correlation Coefficients for Item-Difficulty 
Estimates (Deltas) and New-^Form Total-Test Summary 
Statistics for Different Sample Combinations — Biology 







Total- 


Test 






Equated 


Delta^ 


Sample Combination 


r* 


Mean 


S.D. 


Fall-fall random groups 


.99 


13.09 


1.94 


Fall-spring random groups 


.73 


13.09 


1.94 


Fall-spring matched groups (common items) 


.79 


13.08 


1.94 


Fall-spring matched groups (SDQ) 


.77 


13.09 


1.94 



♦Correlation coefficients between the deltas obtained for the 36 common 
items when given to the new- and old-form samples indicated. 
tStatistical specifications call for an equated delta mean of 13.0 and stan- 
dard deviation of 2.2. 



Ikble 9, Correlation Coefficients for Item-Difficulty 
Estimates (Deltas) and New-Form Total-Test Summary 
Statistics for Diff'erent Sample Combinations- 
Mathematics Level II 







Total 


Test 






Equated 


Delta^ 


Sample Combination 


r* 


Mer I 


S.D. 


Fall-fall random groups 


.99 


15.52 


2A1 


Fall-spring random groups 


.98 


15.53 


2.48 


Fall-spring matched groups (common items) 


.98 


15.53 


2.47 



♦Correlation coefficients between the deltas obtained for the 19 common 
items when given to the new- and old-form samples indicated. 
tStatistical specifications call for an equated delta mean of 15.2 and stan- 
dard deviation of 2.2. 



Table 10* Correlation Coefficients for Item-Difficulty 
Estimates (Deltas) and New-Form Total-Test Summary 
Statistics for Different Sample Combinations — 
American History and Social Studies 







Total- 


Test 






Equated 


Delta^ 


Sample Combination 


r* 


Mean 


S.D. 


Fall-fall random groups 


.94 


11.34 


2.35 


Fall -spring random groups 


.93 


II. .34 


2.34 


Fall-spring matched groups (common items) 


.93 


11.34 


2.36 



*Correlation coefficients between the deltas obtained for the 20 common 
items when given to the new- and old-form samples indicated. 
tStatistical specifications call for an equated delta mean of 1 1.5 and stan- 
dard deviation of 2.2. 



I^bie 11* Correlation Coefficients for Item-Difficulty 
Estimates (Deltas) and New-Form Total-l^st Summary 
Statistics for Different Sample Combinations — 
Chemistry 







Total 


Test 






Equated 


Delta^ 


Sample Combination 


r* 


Mean 


S.D. 


Fall-fall random groups 


.97 


13.14 


1.84 


Fall-spring random groups 


.94 


13.15 


1.82 


Fall-spring matched groups (common items) 


.94 


13.16 


1.85 



^Correlation coefficients between the deltas obtained for the 20 common 
items when given to the new- and old-form samples indicated. 
tStatistical specifications call for an equated delta mean of 13.0 and stan- 
dard deviation of 2.0. 



Ikble 12. Correlation Coefficients for Item-Difficulty 
Estimates (Deltas) and New-Form Total-Test Summary 
Statistics for Different Sample Combinations— French 







Total 


Test 






CtfUUiCU 


ueiia ' 


Sample Combination 


r* 


Mean 


5.D. 


Fall-fall random groups 


.99 


11.69 


3.45 


Fall-spring random groups 


.99 


11.68 


3.45 


Fall-spring matched groups (common items) 


.99 


11.69 


3.44 


Fall-spring matched groups (SDQ) 


.99 


11.69 


3.46 


Fall-spring matched groups (BQ) 


.98 


11.58 


3.65 



♦Correlation coefficients between the deltas obtained for the 21 common 
items when given to the new- and old-form samples indicated. 
tStatistical specifications call for an equated delta mean of 1 1 7 and stan- 
dard deviation of 2.2. 



Tkble 13. New-Form Scaled-Score Summary Statistics Resulting from Equating-Method and Equating-Sample 
Combinations — Biology 

Equating Method 



Levine Levine requency 

Equally Unequally Chained Estimation 

Tucker Reliable Reliable Equipenentile Equipercentile 



Samples 


Mean 


S.D. 


yean. 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


Fall-fall random groups 
Fall-spring random groups 


514.67 
514.13 


103.64 
10^^.82 


514.27 
504.15 


103.30 
106.00 


514.27 
504.45 


103.60 
105.11 


514.58 
509.79 


103.42 
104.23 


514.59 
514.05 


103.76 
105.08 


Fall-spring matched groups 
(common items) 


512.91 


103.46 


512.91 


10.1.46 


512.92 


102.97 


512.97 


103.31 


512.95 


103.46 


Fall-spring matched groups 
(SDQ) 


512.93 


105.22 


501.37 


105.94 


501.70 


104.77 


508.15 


102.73 


513.20 


104.41 



Raw-score frequency distributions used to compute scaled-scoie summary data were obtained from fall new-form total group {N - 7,208). 



Table 14. New-Form ScuJed-Score Summary Statistics Resulting from Equating-Method and Equating-Sample 
Combinations — Mathematics I^vel 11 



Equating Method 









Levine 




l^virie 








Frequency 








Equally 




Unequally 


Chained 


Estimation 




Tucker 




Reliable 


Reliable 


Equipercentile 


Equipercentile 


Samples 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


Mean S.D. 


Fall-fall random groups 


657.51 


79.68 


656.96 


79.10 


656.94 


79.46 


656.45 


78.93 


657.56 79.88 


Fall-spring random groups 


663.58 


81.39 


659.53 


80.10 


659.58 


79.83 


661.86 


80.73 


663.57 81.86 


Fall -spring matched groups 
















^,0 91 


663.10 80.58 


(common items) 


663.02 


80.37 


662.84 


80.59 


662.83 


80.75 


662.88 



Raw-score frequency distributions used to compute scaled-score summary data were obtained from fall nev/-form total group {N - 13,825). 
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TMe 15. New-Form Scaled^Score Summary Statistics Resulting from Equating-Method and Equating«Sample 
Combinations — American History and Social Studies 



Equating Method 



Irvine Levine Frequency 

Equally Unequally Chained Estimation 

Tucker Reliable Reliable Equipercentile Equiperc entile 



Samples 


Mean 


S.l). 


Mean 


S.D. 


Mean 




Mean 


S.D, 


Mean 


S.D, 


Fall-fall random groups 


492.58 


89.07 


491.47 


87.42 


491.42 


89.46 


492.10 


87.92 


492.38 


88.83 


Fall-spring random groups 


495.66 


92.81 


485.50 


89.13 


485. JO 


89.90 


491.99 


89.14 


495.68 


90.93 


Fall-spring matched groups 






















(common items) 


496.18 


92.28 


494.76 


93.05 


494.75 


93.23 


495-95 


92.42 


496.34 


92.04 



Raw-score frequency distributions used to compute scaled-score summary data were obtained from fall new-form total group (N = 16,598). 



Table 16. New-Form Scaled-Score Summary Statistics Resulting from Equating-Method and Equating-Sample 
Combinations — Chemistry 



Equating Method 

Levine Levine Frequency 

Equally Unequally Chained Estimation 



Samples 


Tucker 




Reliable 


Reliable 


Equipercentile 


Equipercentile 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


Mean S.D. 


Fall-fall random groups 


556.21 


94.25 


557.63 


92.36 


557.63 


92.31 


5,%.54 


94.37 


555.87 94.84 


FaJl-spring random groups 


569.48 


93.04 


564.25 


94.36 


564.24 


94.44 


567.65 


92.60 


569.63 92.37 


Fall-spring matched groups 




















(common items) 


570.39 


91.25 


569.88 


92.00 


569.88 


91.89 


570.13 


91.49 


.S 70.45 91.10 



Raw-score frequency distributions used to compute scaled-score summary data were obtained from fall new-form total group (A^ = 8.059). 



Table 17, New-Form Scaled-Score Summary Statistics Resulting from Equating-Method and Equating-Sample 
Combinations — French 



Equating Method 

Levine Irvine -requency 

Equally Unequally Chained estimation 

Tucker Reliable Reliable Equipercentile Ei^.iiperc entile 



Samples 


Mean 


S.D. 


Mean 


SD. 


Mean 


S.D. 


Mean 


S.D. 


Mean 


S.D. 


Fall-fail random groups 


534.00 


103.59 


533.26 


103.29 


533.25 


102.84 


533 92 


103.22 


534.52 


104.05 


Fall-spring random groups 


542.78 


107.44 


537.68 


104.62 


537.73 


103.89 


540.27 


104.45 


542.(4 


106.50 


Fall-spring matched groups 






















(common items) 


542.62 


106.60 


.S42.61 


106.60 


542.58 


105.64 


542.45 


105.99 


.542.67 


106.57 


Fall -spring matched groups 






















(SDQ) 


542.54 


lO^j.68 


5.37.50 


104.06 


537.52 


103.11 


540.04 


103.57 


542. '4 


105.52 


Fall-spring matched groups 






















(BQ) 


546.28 


106.. 4 


538. 37 


103.37 


538.38 


102.0" 


543.04 


102 43 


546 22 


104.84 



Raw-score frequency distributions used to compute scaled-score summary data were obtained from fall new-form total group (A^ - 7.310). 
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